Document to Markdown Conversion
Batch convert documents to markdown format, preserving tracked changes, comments, and other markup.
Usage
/convert-to-md [directory]
Supported Formats
| Format | Method | Notes |
|---|---|---|
| DOCX | pandoc --track-changes=all | Preserves comments & tracked changes |
| PyMuPDF | Text extraction | |
| XLSX | pandas | Converts to markdown tables |
| TXT | rename | Direct rename to .md |
| PPTX | pandoc | Slide content to markdown |
| MSG | extract-msg | Email metadata + body |
| DOC | textutil | macOS native (fallback) |
| DOTX | pandoc | Word templates |
Process
-
Install dependencies (if needed):
uv add pymupdf pandas openpyxl tabulate extract-msg -
Convert DOCX (preserves comments/edits):
for f in *.docx; do pandoc --track-changes=all -f docx -t markdown -o "${f%.docx}.md" "$f" && rm "$f" done -
Convert PDF:
import fitz from pathlib import Path for pdf in Path(".").glob("*.pdf"): doc = fitz.open(pdf) text = "\n\n".join(page.get_text() for page in doc) pdf.with_suffix(".md").write_text(text.strip()) pdf.unlink() -
Convert XLSX to tables:
import pandas as pd for xlsx in Path(".").glob("*.xlsx"): xls = pd.ExcelFile(xlsx) content = f"# {xlsx.stem}\n\n" for sheet in xls.sheet_names: df = pd.read_excel(xlsx, sheet_name=sheet) content += f"## {sheet}\n\n{df.to_markdown(index=False)}\n\n" xlsx.with_suffix(".md").write_text(content) xlsx.unlink() -
Convert TXT:
for f in *.txt; do mv "$f" "${f%.txt}.md"; done -
Convert MSG:
import extract_msg msg = extract_msg.Message("file.msg") content = f"# {msg.subject}\n\n**From:** {msg.sender}\n**Date:** {msg.date}\n\n{msg.body}" -
Clean up: Remove
*:Zone.Identifierfiles (Windows metadata)
Behavior
- Deletes original files after successful conversion
- Skips files that already have a .md counterpart
- Reports failures without stopping batch
Dependencies
pandoc(system): DOCX, PPTX, DOTX conversiontextutil(macOS): DOC fallbackpymupdf(Python): PDF text extractionpandas,openpyxl,tabulate(Python): XLSX tablesextract-msg(Python): Outlook MSG files
