Document Creation
Core Principle
Documents are deliverables. Match the format to the audience: Word for collaborative editing, PDF for final distribution, PowerPoint for presentations, Excel for data and models. Every document should look like it was crafted by a professional, not generated by a script.
Format Selection
| Need | Format | Why |
|---|---|---|
| Collaborative editing, track changes | .docx (Word) | Native revision tracking, comments, universal compatibility |
| Final distribution, no editing | .pdf | Preserves formatting, works everywhere, print-ready |
| Presentations, pitch decks | .pptx (PowerPoint) | Slide-based, speaker notes, animations |
| Data, calculations, models | .xlsx (Excel) | Formulas, charts, pivot tables, dynamic recalculation |
Word Documents (.docx)
Reading Content
# Text extraction with pandoc
pandoc document.docx -o output.md
# With tracked changes preserved
pandoc --track-changes=all document.docx -o output.md
Creating New Documents
Use the docx npm package (JavaScript) for creating Word documents programmatically.
npm install -g docx
const { Document, Packer, Paragraph, TextRun, HeadingLevel } = require('docx');
const fs = require('fs');
const doc = new Document({
sections: [{
properties: {
page: {
size: { width: 12240, height: 15840 }, // US Letter in DXA
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }
}
},
children: [
new Paragraph({
heading: HeadingLevel.HEADING_1,
children: [new TextRun("Document Title")]
}),
new Paragraph({
children: [new TextRun("Body text goes here.")]
}),
]
}]
});
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("output.docx", buffer));
Critical Rules for Word Documents
- Set page size explicitly (defaults to A4, not US Letter)
- Never use \n for line breaks — use separate Paragraph elements
- Never use unicode bullets — use LevelFormat.BULLET with numbering config
- Tables need dual widths: columnWidths on table AND width on each cell
- Always use WidthType.DXA, never PERCENTAGE (breaks in Google Docs)
- ImageRun requires the 'type' parameter (png, jpg, etc.)
- Override built-in heading styles with exact IDs: "Heading1", "Heading2"
PDF Documents
Reading PDFs
# Text extraction
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
# Table extraction
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
Creating PDFs
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body text content.", styles['Normal']))
doc.build(story)
PDF Operations
from pypdf import PdfReader, PdfWriter
# Merge PDFs
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Critical Rules for PDFs
- Never use Unicode subscript/superscript characters in ReportLab
(renders as black boxes) — use <sub> and <super> tags instead
- Use pdfplumber for reading, reportlab for creating, pypdf for manipulation
- For scanned PDFs, use pytesseract OCR (convert to images first)
PowerPoint Presentations (.pptx)
Reading Content
# Text extraction
python -m markitdown presentation.pptx
Design Principles
BEFORE STARTING:
→ Pick a bold, content-informed color palette
→ One color dominates (60-70%), 1-2 supporting, one accent
→ Dark backgrounds for title/conclusion, light for content
→ Commit to ONE visual motif and repeat it across slides
EVERY SLIDE NEEDS:
→ A visual element — image, chart, icon, or shape
→ Text-only slides are forgettable
→ Vary layouts: two-column, icon rows, grids, half-bleed images
TYPOGRAPHY:
→ Choose a header font with personality + clean body font
→ Titles: 36-44pt bold | Body: 14-16pt | Captions: 10-12pt
→ Never default to Arial — use Georgia, Cambria, Trebuchet MS
AVOID:
→ Same layout on every slide
→ Centered body text (left-align paragraphs and lists)
→ Accent lines under titles (hallmark of AI-generated slides)
→ Text-only slides with plain bullets
Creating from Scratch
Use pptxgenjs for programmatic creation:
npm install -g pptxgenjs
Excel Spreadsheets (.xlsx)
Reading Data
import pandas as pd
df = pd.read_excel('file.xlsx') # First sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets
Creating Spreadsheets
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# Add data
sheet['A1'] = 'Revenue'
sheet['B1'] = 'Q1'
sheet.append(['Product A', 50000])
# ALWAYS use formulas, not hardcoded calculations
sheet['B10'] = '=SUM(B2:B9)'
# Formatting
sheet['A1'].font = Font(bold=True)
sheet.column_dimensions['A'].width = 20
wb.save('output.xlsx')
Critical Rules for Excel
- ALWAYS use Excel formulas, never hardcode calculated values
BAD: sheet['B10'] = 5000 (calculated in Python)
GOOD: sheet['B10'] = '=SUM(B2:B9)'
- Financial models: blue text for inputs, black for formulas,
green for cross-sheet links, red for external links
- Number formatting:
→ Currency: $#,##0 with units in headers
→ Percentages: 0.0% (one decimal)
→ Negatives: parentheses (123) not -123
→ Years: text format ("2024" not "2,024")
- Place all assumptions in separate cells, never inline in formulas
- Document data sources for hardcoded values
- Recalculate after creating (formulas stored as strings, not values)
Anti-Patterns
| Anti-Pattern | Why It Hurts | Do This Instead |
|---|---|---|
| Hardcoding calculations in Python | Spreadsheet becomes static, can't update | Use Excel formulas for all calculations |
| Generic blue slides with bullets | Looks AI-generated, forgettable | Bold color palette, visual elements, varied layouts |
| PDFs with Unicode subscripts in ReportLab | Renders as black boxes | Use <sub> and <super> markup tags |
| Word tables with percentage widths | Breaks in Google Docs | Always use WidthType.DXA |
| No tracked changes in collaborative docs | Edits are invisible, trust erodes | Use Word's tracked changes for all modifications |
| Dumping raw data into Excel | Unreadable, no insights | Format with headers, formulas, conditional formatting |
Power Move
"Create a [document type] for [purpose]. Use professional formatting, proper typography, and make it look like it was designed by someone who cares about the details. Include [specific content requirements]. Output as [format]."
The agent becomes your document production team — creating polished, professional deliverables in any format.
