|
| 1 | +# docx_comment_parser |
| 2 | + |
| 3 | +A fast, memory-efficient C++17 shared library (DLL/SO) that extracts **all comment metadata** from `.docx` files, with full Python bindings via pybind11. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +| Feature | Details | |
| 8 | +|---|---| |
| 9 | +| Comment fields | id, author, date, initials, full text, paragraph style | |
| 10 | +| Anchoring | referenced document text (via `commentRangeStart/End`) | |
| 11 | +| Threading | parent/reply relationships (OOXML 2016+ `commentsExtended.xml`) | |
| 12 | +| Resolution | `done` flag, earliest/latest dates, per-author filtering | |
| 13 | +| Batch parsing | Thread-pool with configurable parallelism | |
| 14 | +| Memory | ZIP entries inflated one-at-a-time; SAX for document body; no full DOM | |
| 15 | +| Dependencies | libxml2, zlib (standard on all major platforms) | |
| 16 | +| Python | pybind11 extension module, GIL released during batch parsing | |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Building |
| 21 | + |
| 22 | +### Prerequisites |
| 23 | + |
| 24 | +**Linux / macOS** |
| 25 | +```bash |
| 26 | +sudo apt install libxml2-dev zlib1g-dev # Debian/Ubuntu |
| 27 | +brew install libxml2 zlib # macOS |
| 28 | +pip install pybind11 cmake |
| 29 | +``` |
| 30 | + |
| 31 | +**Windows** |
| 32 | +Install [vcpkg](https://github.com/microsoft/vcpkg) then: |
| 33 | +```powershell |
| 34 | +vcpkg install libxml2 zlib pybind11 |
| 35 | +``` |
| 36 | + |
| 37 | +### CMake (recommended) |
| 38 | + |
| 39 | +```bash |
| 40 | +cmake -B build -DCMAKE_BUILD_TYPE=Release |
| 41 | +cmake --build build -j$(nproc) |
| 42 | +# Optionally run tests: |
| 43 | +cd build && ctest --output-on-failure |
| 44 | +``` |
| 45 | + |
| 46 | +This produces: |
| 47 | +- `build/libdocx_comment_parser.so` (Linux) / `.dylib` (macOS) / `.dll` (Windows) |
| 48 | +- `build/_docx_comment_parser*.so` – Python extension |
| 49 | + |
| 50 | +### pip (Python only) |
| 51 | + |
| 52 | +```bash |
| 53 | +pip install pybind11 |
| 54 | +pip install . |
| 55 | +``` |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## Python Usage |
| 60 | + |
| 61 | +```python |
| 62 | +import docx_comment_parser as dcp |
| 63 | + |
| 64 | +# ── Single file ────────────────────────────────────────────────────────────── |
| 65 | +parser = dcp.DocxParser() |
| 66 | +parser.parse("report.docx") |
| 67 | + |
| 68 | +for c in parser.comments(): |
| 69 | + print(f"[{c.id}] {c.author} ({c.date}): {c.text[:80]}") |
| 70 | + if c.referenced_text: |
| 71 | + print(f" ↳ anchored to: '{c.referenced_text[:60]}'") |
| 72 | + if c.is_reply: |
| 73 | + print(f" ↳ reply to comment #{c.parent_id}") |
| 74 | + |
| 75 | +# Filter by author |
| 76 | +for c in parser.by_author("Alice"): |
| 77 | + print(c.to_dict()) |
| 78 | + |
| 79 | +# Get full thread for a root comment |
| 80 | +for c in parser.thread(0): |
| 81 | + indent = " " if c.is_reply else "" |
| 82 | + print(f"{indent}[{c.id}] {c.author}: {c.text}") |
| 83 | + |
| 84 | +# Stats |
| 85 | +s = parser.stats() |
| 86 | +print(f"Total: {s.total_comments}, Authors: {s.unique_authors}") |
| 87 | +print(f"Date range: {s.earliest_date} → {s.latest_date}") |
| 88 | + |
| 89 | +# ── Batch (parallel) ───────────────────────────────────────────────────────── |
| 90 | +import glob |
| 91 | + |
| 92 | +bp = dcp.BatchParser(max_threads=0) # 0 = auto |
| 93 | +files = glob.glob("/documents/**/*.docx", recursive=True) |
| 94 | +bp.parse_all(files) |
| 95 | + |
| 96 | +for f in files: |
| 97 | + if f in bp.errors(): |
| 98 | + print(f"ERROR {f}: {bp.errors()[f]}") |
| 99 | + continue |
| 100 | + s = bp.stats(f) |
| 101 | + print(f"{f}: {s.total_comments} comments by {len(s.unique_authors)} authors") |
| 102 | + |
| 103 | +bp.release_all() # free memory |
| 104 | +``` |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## C++ Usage |
| 109 | + |
| 110 | +```cpp |
| 111 | +#include "docx_comment_parser.h" |
| 112 | + |
| 113 | +// Single file |
| 114 | +docx::DocxParser parser; |
| 115 | +parser.parse("report.docx"); |
| 116 | + |
| 117 | +for (const auto& c : parser.comments()) { |
| 118 | + std::cout << c.id << " | " << c.author << " | " << c.text << "\n"; |
| 119 | +} |
| 120 | + |
| 121 | +// Batch |
| 122 | +docx::BatchParser bp(/*threads=*/4); |
| 123 | +bp.parse_all({"a.docx", "b.docx", "c.docx"}); |
| 124 | +for (const auto& [path, err] : bp.errors()) |
| 125 | + std::cerr << "Failed: " << path << ": " << err << "\n"; |
| 126 | +bp.release_all(); |
| 127 | +``` |
| 128 | +
|
| 129 | +--- |
| 130 | +
|
| 131 | +## CommentMetadata fields |
| 132 | +
|
| 133 | +| Field | Type | Source | |
| 134 | +|---|---|---| |
| 135 | +| `id` | `int` | `w:id` attribute | |
| 136 | +| `author` | `str` | `w:author` | |
| 137 | +| `date` | `str` | `w:date` (ISO-8601) | |
| 138 | +| `initials` | `str` | `w:initials` | |
| 139 | +| `text` | `str` | Full plain-text of comment body | |
| 140 | +| `paragraph_style` | `str` | Style of first paragraph in comment | |
| 141 | +| `referenced_text` | `str` | Document text anchored by this comment | |
| 142 | +| `is_reply` | `bool` | True if this is a threaded reply | |
| 143 | +| `parent_id` | `int` | id of parent comment (-1 if root) | |
| 144 | +| `replies` | `list[CommentRef]` | Direct replies (populated on parent) | |
| 145 | +| `para_id` | `str` | OOXML 2016+ paragraph ID | |
| 146 | +| `para_id_parent` | `str` | Parent paragraph ID (before id resolution) | |
| 147 | +| `done` | `bool` | Resolved/done flag (`commentsExtended.xml`) | |
| 148 | +| `thread_ids` | `list[int]` | Ordered ids in this thread (root only) | |
| 149 | +| `paragraph_index` | `int` | 0-based paragraph in document body | |
| 150 | +| `run_index` | `int` | 0-based run within paragraph | |
| 151 | +
|
| 152 | +--- |
| 153 | +
|
| 154 | +## Architecture |
| 155 | +
|
| 156 | +``` |
| 157 | +docx_comment_parser/ |
| 158 | +├── include/ |
| 159 | +│ ├── docx_comment_parser.h # Public API (CommentMetadata, DocxParser, BatchParser) |
| 160 | +│ ├── zip_reader.h # ZIP reader interface (zlib only, no libzip) |
| 161 | +│ └── xml_utils.h # Lightweight libxml2 helpers |
| 162 | +├── src/ |
| 163 | +│ ├── zip_reader.cpp # Memory-mapped ZIP + inflate |
| 164 | +│ ├── docx_parser.cpp # Core: comments.xml (DOM) + document.xml (SAX) |
| 165 | +│ └── batch_parser.cpp # Thread-pool batch processing |
| 166 | +├── python/ |
| 167 | +│ └── python_bindings.cpp # pybind11 module |
| 168 | +├── tests/ |
| 169 | +│ └── test_docx_parser.cpp # Self-contained test suite |
| 170 | +├── CMakeLists.txt |
| 171 | +└── setup.py |
| 172 | +``` |
| 173 | +
|
| 174 | +### Memory model |
| 175 | +
|
| 176 | +- **ZIP entries** are memory-mapped and inflated one at a time; no entry's data is kept in memory while another is being read. |
| 177 | +- **`comments.xml`** is parsed with libxml2 DOM (typically < 100 KB). |
| 178 | +- **`document.xml`** (which can be very large) is streamed with libxml2 SAX2; only the anchor text accumulator is kept in memory. |
| 179 | +- **BatchParser** runs one `DocxParser` per thread; results can be individually `release()`d to reclaim memory after use. |
| 180 | +
|
| 181 | +--- |
| 182 | +
|
| 183 | +## License |
| 184 | +
|
| 185 | +MIT |
0 commit comments