DOCX Creation, Editing, and Analysis
Read the relevant reference file completely before starting work:
- Creating a new document: read
references/docx-js.md - Editing an existing document: read
references/ooxml.md
Workflow Decision Tree
| Task | Workflow | Reference |
|---|---|---|
| Read/analyse content | Text extraction (pandoc) or Raw XML | None needed |
| Create new document | docx-js (JavaScript) | references/docx-js.md |
| Edit your own doc (simple) | OOXML editing | references/ooxml.md |
| Edit someone else's doc | Redlining workflow (recommended) | references/ooxml.md |
| Legal/business/government | Redlining workflow (required) | references/ooxml.md |
Reading and Analysing Content
Text Extraction (Default)
Convert the document to markdown with pandoc:
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept (default) / reject / all
Default to --track-changes=all to preserve revision history. Use accept only when the user wants clean text without markup.
Raw XML Access
Use raw XML when you need: comments, complex formatting, document structure, embedded media, or metadata.
python ooxml/scripts/unpack.py <office_file> <output_directory>
Key files after unpacking:
word/document.xml-- main document bodyword/comments.xml-- comments referenced in document.xmlword/media/-- embedded images and media- Tracked changes use
<w:ins>(insertions) and<w:del>(deletions) tags
Creating a New Word Document
Use docx-js (JavaScript/TypeScript) for new documents.
- Read
references/docx-js.mdcompletely - Write a script using Document, Paragraph, TextRun components
- Export with
Packer.toBuffer() - Verify the output opens in Word/LibreOffice without errors
Action:
- Read
references/docx-js.md - Create script with Document, Paragraph, TextRun, numbering config for bullets
- Run:
node memo.js - Verify:
soffice --headless --convert-to pdf memo.docx && pdftoppm -jpeg -r 150 memo.pdf preview
Editing an Existing Word Document
Use the Document library (Python) from scripts/document.py. It handles infrastructure setup automatically (people.xml, RSIDs, settings.xml, comments, relationships, content types).
Standard Editing Workflow
- Read
references/ooxml.mdcompletely (focus on "Document Library" section) - Unpack:
python ooxml/scripts/unpack.py <file.docx> <output_dir> - Edit using Document library methods
- Pack:
python ooxml/scripts/pack.py <output_dir> <result.docx> - Verify: convert to markdown and check output
Action:
from scripts.document import Document
doc = Document('unpacked', track_revisions=True)
node = doc["word/document.xml"].get_node(tag="w:r", contains="30 days")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = (
f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t>within </w:t></w:r>'
f'<w:del><w:r>{rpr}<w:delText>30</w:delText></w:r></w:del>'
f'<w:ins><w:r>{rpr}<w:t>60</w:t></w:r></w:ins>'
f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t> days</w:t></w:r>'
)
doc["word/document.xml"].replace_node(node, replacement)
doc.save()
Redlining Workflow (Document Review with Tracked Changes)
Plan tracked changes in markdown before implementing in OOXML. Group related changes into batches of 3-10 for manageable debugging.
Principle: Minimal, Precise Edits. Only mark text that actually changes. Repeating unchanged text makes edits harder to review. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text.
Step-by-Step
-
Get markdown representation:
pandoc --track-changes=all path-to-file.docx -o current.md -
Identify and group changes. Organise into batches by section, type, or proximity. Use these location methods for finding text in XML:
- Section/heading numbers (e.g., "Section 3.2")
- Grep patterns with unique surrounding text
- Document structure (e.g., "first paragraph after Heading 2")
- Do NOT use markdown line numbers -- they do not map to XML structure
-
Read documentation and unpack:
- Read
references/ooxml.md-- focus on "Document Library" and "Tracked Change Patterns" - Unpack:
python ooxml/scripts/unpack.py <file.docx> <dir> - Note the suggested RSID from unpack script
- Read
-
Implement changes in batches. For each batch:
- Grep
word/document.xmlto verify current text and line numbers (they shift after each script) - Write a script using
get_nodeto find nodes, thenreplace_node,suggest_deletion, orinsert_after - Run the script and verify with
doc.save()
- Grep
-
Pack the document:
python ooxml/scripts/pack.py unpacked reviewed-document.docx -
Final verification:
pandoc --track-changes=all reviewed-document.docx -o verification.md grep "original phrase" verification.md # Should NOT match grep "replacement phrase" verification.md # Should match
Batch plan:
- Batch 1 (Term changes): "2 years" to "1 year" in Section 5
- Batch 2 (Jurisdiction): "New York" to "Delaware" in Section 8
Per batch: grep for text, write script, run, verify. After all batches, pack and do final verification.
Method Selection Guide
| Scenario | Method |
|---|---|
| Change part of regular text | replace_node() with <w:del>/<w:ins> |
| Delete entire run or paragraph | suggest_deletion() |
| Reject another author's insertion | revert_insertion() (NOT suggest_deletion()) |
| Restore another author's deletion | revert_deletion() |
| Partially modify another author's change | replace_node() with nested <w:ins>/<w:del> |
Converting Documents to Images
Two-step process for visual analysis:
# Step 1: DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Step 2: PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
# Creates page-1.jpg, page-2.jpg, etc.
# For specific pages only:
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page
Use -r 150 for a good quality/size balance. Increase to 300 for print-quality output.
Code Style
Write concise code. Avoid verbose variable names, redundant operations, and unnecessary print statements.
Dependencies
Install if not available:
| Dependency | Install | Purpose |
|---|---|---|
| pandoc | brew install pandoc or apt-get install pandoc | Text extraction |
| docx | npm install -g docx | Creating new documents |
| LibreOffice | brew install --cask libreoffice or apt-get install libreoffice | PDF conversion |
| Poppler | brew install poppler or apt-get install poppler-utils | PDF to images |
| defusedxml | pip install defusedxml | Secure XML parsing |
References
| File | Purpose |
|---|---|
references/docx-js.md | docx-js API patterns for creating new documents |
references/ooxml.md | OOXML XML patterns, Document library API, tracked changes |
