Scientific Data Extraction Skill

Overview

This skill provides comprehensive guidance for extracting structured data from scientific literature across multiple input formats (PDF, HTML, images, plain text). It auto-detects the scientific domain to recommend specialized tools when appropriate (particularly for chemistry and materials science) and employs a hierarchical extraction approach with multi-method validation for high-confidence results.

When to Use This Skill

Use this skill when you need to:

Extract numerical data from scientific papers, reports, or documents
Digitize graphs and plots to recover underlying data points
Parse tables from PDFs or images into structured formats (CSV, DataFrame, JSON)
Extract chemical/materials data including properties, reactions, compounds, and structures
Convert unstructured text to structured JSON or tabular formats
Validate extracted data through multi-method cross-checking
Process document batches with consistent extraction methodology

Input Format Detection

The first step is identifying the input format and routing to appropriate tools:

Plain Text (.txt, .md)

Domain detection via keyword analysis
NLP-based entity extraction (spaCy, Stanza)
Regex patterns for structured data (numbers with units, chemical formulas)
LLM-based structured extraction

HTML (.html, web pages)

HTML parsing with BeautifulSoup + lxml
Table detection and extraction
Text content extraction with structure preservation
Domain-specific processing after text extraction

PDF (.pdf)

Priority	Tool	Speed	Use Case
Quick	PyMuPDF4LLM	~0.12s	Initial exploration, large batches
Standard	GROBID	Medium	Research-grade, reference parsing
Standard	Docling	Medium	Layout-aware, complex documents
Tables	Camelot	Fast	Bordered tables
Tables	Tabula	Fast	General tables
Tables	pdfplumber	Medium	Complex table structures
Deep	Marker-PDF	Slower	Scanned documents with OCR

Images (.png, .jpg, .tiff)

Content Type	Recommended Approach
Document scan	OCR (Tesseract/Surya) then text pipeline
Graph/Plot	WebPlotDigitizer workflow or LLM vision
Table image	Table Transformer or LLM vision
Chemical structure	OSRA/MolVec for SMILES conversion

Domain Detection

The skill automatically detects scientific domain to apply specialized tools:

Chemistry/Materials Indicators

Chemical formulas (H2O, NaCl, TiO2)
SMILES strings, InChI identifiers
Reaction arrows (→, ⟶, ⇌)
Property keywords: melting point, bandgap, conductivity, yield, purity
Material names and IUPAC nomenclature
Spectroscopic data patterns (NMR shifts, IR peaks)

When Chemistry/Materials Detected

Apply specialized tools:

ChemDataExtractor v2: Property extraction, entity recognition, table parsing
OpenChemIE: Reaction extraction from text, tables, and figures
Domain-specific NER: Chemical named entity recognition

General Scientific Domain

Use general-purpose extraction:

Standard NLP pipelines
LLM-based structured extraction
Template-based parsing

Extraction Method Hierarchy

Apply methods in order of increasing complexity based on requirements:

Level 1: Quick Extraction (Speed Priority)

When to use: Initial exploration, large document batches, simple structured data

# Quick PDF to text with PyMuPDF4LLM
import pymupdf4llm
text = pymupdf4llm.to_markdown("paper.pdf")

# Quick HTML parsing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')
tables = soup.find_all('table')

Expected confidence: Lower, suitable for screening

Level 2: Standard Extraction (Balanced)

When to use: Research-grade extraction, structure preservation needed

# GROBID for structured PDF parsing
import scipdf_parser
article = scipdf_parser.parse_pdf_to_dict("paper.pdf")

# Docling for layout-aware extraction
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("paper.pdf")

# Camelot for bordered tables
import camelot
tables = camelot.read_pdf("paper.pdf", flavor='lattice')
df = tables[0].df

Expected confidence: Medium-high

Level 3: Deep Extraction (Accuracy Priority)

When to use: Publication-quality data, domain-specific extraction

# ChemDataExtractor for chemistry documents
from chemdataextractor import Document
doc = Document.from_file("paper.pdf")
records = doc.records

# OpenChemIE for reaction extraction
from openchemie import OpenChemIE
model = OpenChemIE()
reactions = model.extract_reactions_from_text(text)

# Marker-PDF with OCR for scanned documents
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
result = converter("scanned_paper.pdf")

Expected confidence: High

Level 4: LLM-Enhanced Extraction

When to use: Complex figures, ambiguous data, validation needed

# LLM-based structured extraction
prompt = """
Extract all numerical data from this text as JSON:
- Property name
- Value (number only)
- Unit
- Context (what material/compound)

Text: {text}
"""

# LLM vision for graph interpretation
prompt = """
Analyze this graph image and extract:
1. X-axis label and range
2. Y-axis label and range
3. All data points as (x, y) pairs
4. Any error bars or uncertainty indicators
"""

Expected confidence: Highest when combined with validation

Multi-Method Validation Pipeline

For high-confidence results, use multiple extraction methods and validate:

Step 1: Primary Extraction

Select method based on input type and domain, extract structured data.

Step 2: Secondary Extraction

Run alternative method on same source, compare results and flag discrepancies.

Step 3: LLM Verification Queries

Ask targeted questions to verify extracted data:

"Is this value X consistent with the context Y?"
"Does unit Z make sense for property P?"
"Are there any missing data points in the expected range?"

Step 4: Confidence Scoring

confidence = {
    "score": 0.0,  # 0-1 scale
    "level": "HIGH|MEDIUM|LOW|REVIEW",
    "methods_agreed": [],  # List of methods that produced same result
    "discrepancies": [],   # Any disagreements between methods
    "verification_notes": ""  # LLM verification outcome
}

# Scoring rules:
# - Single method: max 0.7
# - Two methods agree: 0.8
# - Two methods + LLM verification: 0.9
# - Multiple methods + LLM + database cross-reference: 0.95+

Step 5: Database Cross-Reference (Optional)

For chemistry/materials, compare against known databases:

Materials Project
AFLOW
PubChem
NIST databases

Flag significant deviations from expected ranges.

Output Format

Structure extracted data consistently:

{
  "extraction_metadata": {
    "source": "path/to/document.pdf",
    "source_type": "pdf",
    "domain_detected": "chemistry",
    "methods_used": ["grobid", "chemdataextractor", "llm_verification"],
    "timestamp": "2025-01-18T..."
  },
  "extracted_data": [
    {
      "data_type": "material_property",
      "entity": "TiO2",
      "property": "bandgap",
      "value": 3.2,
      "unit": "eV",
      "source_location": {
        "page": 4,
        "section": "Results",
        "table_id": "Table 2",
        "row": 3
      },
      "confidence": {
        "score": 0.95,
        "level": "HIGH",
        "methods_agreed": ["chemdataextractor", "llm_extraction"],
        "verification_notes": "Value consistent with literature range 3.0-3.4 eV"
      }
    }
  ],
  "validation_summary": {
    "total_extracted": 47,
    "high_confidence": 38,
    "medium_confidence": 7,
    "needs_review": 2,
    "discrepancies": []
  }
}

Step-by-Step Instructions

For PDF Data Extraction

Identify document type: Scanned or text-based PDF
Choose extraction level: Based on accuracy requirements
Detect domain: Check for chemistry/materials indicators
Extract text/structure: Use appropriate tool from hierarchy
Extract tables separately: Use Camelot, Tabula, or pdfplumber
Apply domain tools: If chemistry detected, use ChemDataExtractor
Validate: Run secondary extraction or LLM verification
Format output: Structure as JSON with confidence scores

For Graph/Plot Digitization

Assess graph quality: Resolution, clarity, labeling
Identify graph type: Line plot, scatter, bar chart, contour
Choose method:
- Simple, clear graphs: WebPlotDigitizer (manual calibration)
- Complex or batch: LLM vision extraction
Calibrate axes: Define coordinate system
Extract data points: Manual selection or automatic detection
Validate: Check extracted points against visual inspection
Export: CSV or JSON format with uncertainty estimates

For Table Extraction

Identify table type: Bordered (lattice) or borderless (stream)
Choose tool:
- Bordered: Camelot with flavor='lattice'
- Borderless: Tabula or Camelot with flavor='stream'
- Complex: pdfplumber for fine-grained control
Extract to DataFrame: Review structure and headers
Clean data: Fix merged cells, missing values, formatting
Apply domain parsing: Convert units, parse chemical formulas
Validate: Compare against source visually
Export: CSV, JSON, or integrate into dataset

For Chemistry/Materials Extraction

Confirm domain: Verify chemistry/materials content
Choose specialized tool:
- Properties: ChemDataExtractor v2
- Reactions: OpenChemIE
- Structures from images: OSRA/MolVec
Configure extraction: Set up parsers for target properties
Run extraction: Process document with domain tools
Post-process: Normalize units, standardize identifiers
Cross-reference: Compare against databases (Materials Project, PubChem)
Validate: LLM verification of unusual values
Export: Structured JSON with confidence scores

Best Practices

Always start with format detection - Correct tool selection depends on accurate format identification
Use the simplest method that works - Start at Level 1 and escalate only if needed
Preserve source location - Track page numbers, sections, table IDs for traceability
Validate unusual values - Any value outside expected ranges should be flagged and verified
Document extraction methodology - Record which tools and settings produced each data point
Handle uncertainty explicitly - Include error bounds when available, note when values are approximate
Cross-reference chemistry data - Always compare against known databases for sanity checking
Use LLM verification judiciously - Most valuable for complex figures and ambiguous cases

Requirements

Core Python Packages

pymupdf4llm: Quick PDF extraction
pdfplumber: Detailed PDF analysis
camelot-py: Table extraction (requires ghostscript)
beautifulsoup4, lxml: HTML parsing
spacy: NLP processing
pandas: Data manipulation

Domain-Specific (Chemistry)

chemdataextractor: Chemistry NLP (v2 recommended)
openchemie: Reaction extraction

Optional

tabula-py: Table extraction (requires Java)
grobid (server): Academic PDF parsing
docling: IBM document converter
marker-pdf: OCR-capable PDF conversion
tesseract or surya: OCR engines

Limitations

Scanned documents require OCR - Quality depends on scan resolution and OCR accuracy
Complex table structures - Merged cells, nested headers may require manual correction
Graph digitization is approximate - Precision limited by image resolution and calibration
Domain tools are specialized - Chemistry tools won't work well on biology or physics texts
LLM extraction can hallucinate - Always validate with source or alternative method
Some PDFs are protected - May not be extractable due to DRM or image-only content

Related Skills

literature-review: For systematic literature searching and synthesis
scientific-reviewer: For evaluating extracted data quality
materials-databases: For cross-referencing extracted chemistry/materials data
python-plotting: For visualizing extracted data

References

See the references/ directory for detailed documentation on:

pdf-tools.md: Comprehensive PDF extraction tool comparison
table-extraction.md: Table extraction methods and code examples
graph-digitization.md: Graph data extraction techniques
chemistry-tools.md: ChemDataExtractor and OpenChemIE usage
llm-extraction.md: LLM-based extraction patterns and validation

See the examples/ directory for complete workflows:

extract-from-pdf.md: End-to-end PDF extraction example
extract-table-data.md: Table extraction comparison
digitize-graph.md: Graph digitization guide
chemistry-extraction.md: Chemistry-specific extraction workflow

scientific-data-extractionSafety 100Repository ShareFavorite skill

Package Files