Unstructured PDF Generation

Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.

Overview

This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:

LLM-generated content based on your description
Accompanying JSON files with questions and evaluation guidelines (for RAG testing)
Automatic upload to Unity Catalog Volumes

Quick Start

Use the generate_pdf_documents MCP tool:

catalog: "my_catalog"
schema: "my_schema"
description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."
count: 10

This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).

With Custom Location

Use the generate_pdf_documents MCP tool:

catalog: "my_catalog"
schema: "my_schema"
description: "HR policy documents..."
count: 10
volume: "custom_volume"
folder: "hr_policies"
overwrite_folder: true

Parameters

Parameter	Type	Required	Default	Description
`catalog`	string	Yes	-	Unity Catalog name
`schema`	string	Yes	-	Schema name
`description`	string	Yes	-	Detailed description of what PDFs should contain
`count`	int	Yes	-	Number of PDFs to generate
`volume`	string	No	`raw_data`	Volume name (created if not exists)
`folder`	string	No	`pdf_documents`	Folder within volume for output files
`doc_size`	string	No	`MEDIUM`	Document size: `SMALL` (~1 page), `MEDIUM` (~5 pages), `LARGE` (~10+ pages)
`overwrite_folder`	bool	No	`false`	If true, deletes existing folder contents first

Document Size Guide

SMALL: ~1 page, concise content. Best for quick demos or testing.
MEDIUM: ~4-6 pages, comprehensive coverage. Good balance for most use cases.
LARGE: ~10+ pages, exhaustive documentation. Use for thorough RAG evaluation.

Output Files

For each document, the tool creates two files:

PDF file (<model_id>.pdf): The generated document
JSON file (<model_id>.json): Metadata for RAG evaluation

JSON Structure

{
  "title": "API Authentication Guide",
  "category": "Technical",
  "pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf",
  "question": "What authentication methods are supported by the API?",
  "guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases."
}

Common Patterns

Pattern 1: HR Policy Documents

Use the generate_pdf_documents MCP tool:

catalog: "ai_dev_kit"
schema: "hr_demo"
description: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines."
count: 15
folder: "hr_policies"
overwrite_folder: true

Pattern 2: Technical Documentation

Use the generate_pdf_documents MCP tool:

catalog: "ai_dev_kit"
schema: "tech_docs"
description: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials."
count: 20
folder: "product_docs"
overwrite_folder: true

Pattern 3: Financial Reports

Use the generate_pdf_documents MCP tool:

catalog: "ai_dev_kit"
schema: "finance_demo"
description: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures."
count: 12
folder: "reports"
overwrite_folder: true

Pattern 4: Training Materials

Use the generate_pdf_documents MCP tool:

catalog: "ai_dev_kit"
schema: "training"
description: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows."
count: 8
folder: "courses"
overwrite_folder: true

Workflow

Ask for destination: Default to ai_dev_kit catalog, ask user for schema name
Get description: Ask what kind of documents they need
Generate PDFs: Call generate_pdf_documents MCP tool with appropriate parameters
Verify output: Check the volume path for generated files

Best Practices

Detailed descriptions: The more specific your description, the better the generated content
- BAD: "Generate some HR documents"
- GOOD: "HR policy documents for a technology company including employee handbook covering remote work policies, leave policies with PTO and sick leave details, performance review procedures with quarterly and annual cycles, and workplace conduct guidelines"
Appropriate count:
- For demos: 5-10 documents
- For RAG testing: 15-30 documents
- For comprehensive evaluation: 50+ documents
Folder organization: Use descriptive folder names that indicate content type
- hr_policies/
- technical_docs/
- training_materials/
Use overwrite_folder: Set to true when regenerating to ensure clean state

Integration with RAG Pipelines

The generated JSON files are designed for RAG evaluation:

Ingest PDFs: Use the PDF files as source documents for your vector database
Test retrieval: Use the question field to query your RAG system
Evaluate answers: Use the guideline field to assess if the RAG response is correct

Example evaluation workflow:

# Load questions from JSON files
questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json")

for q in questions:
    # Query RAG system
    response = rag_system.query(q["question"])

    # Evaluate using guideline
    is_correct = evaluate_response(response, q["guideline"])

Environment Configuration

The tool requires LLM configuration via environment variables:

# Databricks Foundation Models (default)
LLM_PROVIDER=DATABRICKS
DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct

# Or Azure OpenAI
LLM_PROVIDER=AZURE
AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

Common Issues

Issue	Solution
"No LLM endpoint configured"	Set `DATABRICKS_MODEL` or `AZURE_OPENAI_DEPLOYMENT` environment variable
"Volume does not exist"	The tool creates volumes automatically; ensure you have CREATE VOLUME permission
"PDF generation timeout"	Reduce `count` or check LLM endpoint availability
Low quality content	Provide more detailed `description` with specific topics and document types

Related Skills

databricks-agent-bricks - Create Knowledge Assistants that ingest the generated PDFs
databricks-vector-search - Index generated documents for semantic search and RAG
databricks-synthetic-data-generation - Generate structured tabular data (complement to unstructured PDFs)
databricks-mlflow-evaluation - Evaluate RAG systems using the generated question/guideline pairs

databricks-unstructured-pdf-generationSafety --Repository

Package Files

Unstructured PDF Generation

Overview

Quick Start

With Custom Location

Parameters

Document Size Guide

Output Files

JSON Structure

Common Patterns

Pattern 1: HR Policy Documents

Pattern 2: Technical Documentation

Pattern 3: Financial Reports

Pattern 4: Training Materials

Workflow

Best Practices

Integration with RAG Pipelines

Environment Configuration

Common Issues

Related Skills

Install

AI Quality Score

Metadata

Tags

databricks-unstructured-pdf-generationSafety --Repository ShareFavorite skill

Package Files

Unstructured PDF Generation

Overview

Quick Start

With Custom Location

Parameters

Document Size Guide

Output Files

JSON Structure

Common Patterns

Pattern 1: HR Policy Documents

Pattern 2: Technical Documentation

Pattern 3: Financial Reports

Pattern 4: Training Materials

Workflow

Best Practices

Integration with RAG Pipelines

Environment Configuration

Common Issues

Related Skills

Install

AI Quality Score

Metadata

Tags

databricks-unstructured-pdf-generationSafety --Repository