Extraction Pipeline Design

This document describes the extraction pipeline for converting raw PDF and HTML documents into structured data.

🎯 Objectives

Extract text content preserving semantic structure
Identify and parse tabular data
Handle multi-column layouts and complex formatting
Normalize encoding and special characters

📄 PDF Extraction Pipeline

flowchart TD
    PDF[PDF Files]
    Layout[\LayoutLM/] -.- DLN{{DocLayNet}}
    EPDF[Structured PDF Files]

    PDF ~~~ DLN ~~~ PExt

    PExt>Raw Extraction]
    PNorm>Primary Normalization]
    PCorr>Structural Cleanup]
    PRep>Repair]
    POut@{ shape: lin-rect, label: "Data With Syntax Repaired" }

    PDF -.-> Layout -.-> EPDF -.-> PExt
    PDF ==> PExt --> PNorm --> PCorr --> PRep --> POut

Stage Descriptions

Stage	Input	Output	Tools
Raw Extraction	PDF binary	Raw text + positions	PyMuPDF
Primary Normalization	Raw text	UTF-8 normalized	Python stdlib
Structural Cleanup	Normalized text	Logical sections	Custom rules
Repair	Sections	Valid structured data	Validation

🌐 HTML Extraction Pipeline

flowchart TD
    HTML[HTML Files]
    DOM[/Layout DOM\] -.- DLD{{DOMLayoutData}}
    EHTML[Structured HTML Files]

    HTML ~~~ DLD ~~~ HExt

    HExt>Raw Extraction]
    HNorm>Primary Normalization]
    HCorr>Structural Cleanup]
    HRep>Repair]
    HOut@{ shape: lin-rect, label: "Data With Syntax Repaired" }

    HTML -.-> DOM -.-> EHTML -.-> HExt
    HTML ==> HExt --> HNorm --> HCorr --> HRep --> HOut

⚠️ Known Challenges

Scanned PDFs - Require OCR preprocessing
Multi-column layouts - Column detection can fail
Tables spanning pages - Header detection issues
Special characters - Pharmaceutical symbols and units

Implementation: See drugslm.process for code documentation.