Document Processor
Analyze, summarize, and convert office documents.
Supported Formats
| Format | Read | Write | Tools |
|---|---|---|---|
| ✅ | ✅ | pdfplumber, pypdf | |
| DOCX | ✅ | ✅ | python-docx |
| XLSX | ✅ | ✅ | openpyxl |
| PPTX | ✅ | ✅ | python-pptx |
Quick Reference
PDF Text Extraction
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
text = "\n".join(p.extract_text() for p in pdf.pages)
Excel Reading
import openpyxl
wb = openpyxl.load_workbook("data.xlsx")
ws = wb.active
data = [[cell.value for cell in row] for row in ws.iter_rows()]
Word Document
from docx import Document
doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)
Workflows
Summarize PDF
- Extract text with pdfplumber
- Pass to Claude for summarization
- Output markdown summary
Convert Excel to CSV
import pandas as pd
df = pd.read_excel("data.xlsx")
df.to_csv("data.csv", index=False)
Extract Tables from PDF
with pdfplumber.open("doc.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
Best Practices
- Use pdfplumber for complex PDFs (tables, layouts)
- Use pypdf for simple text extraction
- Convert to markdown for AI processing
