Extraction Pipeline Design
This document describes the extraction pipeline for converting raw PDF and HTML documents into structured data.
๐ฏ Objectives
- Extract text content preserving semantic structure
- Identify and parse tabular data
- Handle multi-column layouts and complex formatting
- Normalize encoding and special characters
๐ PDF Extraction Pipeline
flowchart TD
PDF[PDF Files]
Layout[\LayoutLM/] -.- DLN{{DocLayNet}}
EPDF[Structured PDF Files]
PDF ~~~ DLN ~~~ PExt
PExt>Raw Extraction]
PNorm>Primary Normalization]
PCorr>Structural Cleanup]
PRep>Repair]
POut@{ shape: lin-rect, label: "Data With Syntax Repaired" }
PDF -.-> Layout -.-> EPDF -.-> PExt
PDF ==> PExt --> PNorm --> PCorr --> PRep --> POut
Stage Descriptions
| Stage | Input | Output | Tools |
|---|---|---|---|
| Raw Extraction | PDF binary | Raw text + positions | PyMuPDF |
| Primary Normalization | Raw text | UTF-8 normalized | Python stdlib |
| Structural Cleanup | Normalized text | Logical sections | Custom rules |
| Repair | Sections | Valid structured data | Validation |
๐ HTML Extraction Pipeline
flowchart TD
HTML[HTML Files]
DOM[/Layout DOM\] -.- DLD{{DOMLayoutData}}
EHTML[Structured HTML Files]
HTML ~~~ DLD ~~~ HExt
HExt>Raw Extraction]
HNorm>Primary Normalization]
HCorr>Structural Cleanup]
HRep>Repair]
HOut@{ shape: lin-rect, label: "Data With Syntax Repaired" }
HTML -.-> DOM -.-> EHTML -.-> HExt
HTML ==> HExt --> HNorm --> HCorr --> HRep --> HOut
โ ๏ธ Known Challenges
- Scanned PDFs - Require OCR preprocessing
- Multi-column layouts - Column detection can fail
- Tables spanning pages - Header detection issues
- Special characters - Pharmaceutical symbols and units
Implementation: See drugslm.process for code documentation.