Skip to content

Extraction Pipeline Design

This document describes the extraction pipeline for converting raw PDF and HTML documents into structured data.

๐ŸŽฏ Objectives

  1. Extract text content preserving semantic structure
  2. Identify and parse tabular data
  3. Handle multi-column layouts and complex formatting
  4. Normalize encoding and special characters

๐Ÿ“„ PDF Extraction Pipeline

flowchart TD
    PDF[PDF Files]
    Layout[\LayoutLM/] -.- DLN{{DocLayNet}}
    EPDF[Structured PDF Files]

    PDF ~~~ DLN ~~~ PExt

    PExt>Raw Extraction]
    PNorm>Primary Normalization]
    PCorr>Structural Cleanup]
    PRep>Repair]
    POut@{ shape: lin-rect, label: "Data With Syntax Repaired" }

    PDF -.-> Layout -.-> EPDF -.-> PExt
    PDF ==> PExt --> PNorm --> PCorr --> PRep --> POut

Stage Descriptions

Stage Input Output Tools
Raw Extraction PDF binary Raw text + positions PyMuPDF
Primary Normalization Raw text UTF-8 normalized Python stdlib
Structural Cleanup Normalized text Logical sections Custom rules
Repair Sections Valid structured data Validation

๐ŸŒ HTML Extraction Pipeline

flowchart TD
    HTML[HTML Files]
    DOM[/Layout DOM\] -.- DLD{{DOMLayoutData}}
    EHTML[Structured HTML Files]

    HTML ~~~ DLD ~~~ HExt

    HExt>Raw Extraction]
    HNorm>Primary Normalization]
    HCorr>Structural Cleanup]
    HRep>Repair]
    HOut@{ shape: lin-rect, label: "Data With Syntax Repaired" }

    HTML -.-> DOM -.-> EHTML -.-> HExt
    HTML ==> HExt --> HNorm --> HCorr --> HRep --> HOut

โš ๏ธ Known Challenges

  1. Scanned PDFs - Require OCR preprocessing
  2. Multi-column layouts - Column detection can fail
  3. Tables spanning pages - Header detection issues
  4. Special characters - Pharmaceutical symbols and units

Implementation: See drugslm.process for code documentation.