Skip to content

1 Data Acquisition and Preparation

A. Data Sources & Ingestion

flowchart TD
    classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef external fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#000

    ANVISA{{"ANVISA Drugs"}}:::external
    Scraper([Web Scraper & Indexer]):::proc
    Downloader([Batch PDF Downloader]):::proc

    MetaLinks["URL List & Metadata"]:::bronze
    PDFs[PDF Files]:::bronze

    ANVISA --- Scraper --> MetaLinks --- Downloader --> PDFs

    click ANVISA "https://consultas.anvisa.gov.br/#/bulario/" "Access ANVISA Electronic Drug Information" _blank
    click Scraper "../../../reference/datasets/sources/anvisa/categories" "View Scraper Code"

B. Segmentation & Extraction

flowchart TD
    classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000

    PDFs[PDF Files]:::bronze

    LayoutModel[/Layout Analysis\]:::model
    OCREngine[/OCR Engine\]:::model

    BBoxJSON@{ shape: lin-rect, label: "Layout Bounding Boxes" }
    TextBlocks@{ shape: lin-rect, label: "Extracted Text Blocks" }

    PDFs --- LayoutModel --> BBoxJSON --- OCREngine
    PDFs --- OCREngine --> TextBlocks

    class BBoxJSON,TextBlocks silver

C. Normalization & Structuring

flowchart TD
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000

    TextBlocks@{ shape: lin-rect, label: "Extracted Text Blocks" }

    SectionTagger[/Section Tagger\]:::model
    TextNormalizer([Text Normalizer]):::proc

    TaggedJSON@{ shape: lin-rect, label: "Tagged JSON" }
    NormalizedJSON@{ shape: lin-rect, label: "Normalized JSON" }

    TextBlocks --- SectionTagger --> TaggedJSON --- TextNormalizer --> NormalizedJSON

    class TextBlocks,TaggedJSON,NormalizedJSON silver

D. Curation & Consolidation

Essa etapa é responsável por dar sentido aos dados extraídos e normalizados em formato JSON. Por conter alto valor semântico, a intensão era usar isso para realizar a deduplicação e consolidação dos dados de forma que nenhuma informação fosse perdida ou permanesesse duplicada, mesmo que escrita de forma diferente. Sabendo a natureza de cada dado, o que eles significam e como se relacionam, o dado final teria as versões equivalentes separadas para uso futuro e uma versao central com todas as informacoes extraiadas.

flowchart TD
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000

    NormalizedJSON@{ shape: lin-rect, label: "Normalized JSON" }

    BioNER[/Biomedical NER\]:::model
    EntJSON@{ shape: lin-rect, label: "Labeled Entities" }

    EntityRes([Entity Resolution]):::proc
    Groups@{ shape: lin-rect, label: "Entity Groups" }

    SemSim[/Semantic Similarity\]:::model
    DupMap@{ shape: lin-rect, label: "Duplicate Mapping" }

    MergeEngine([Merge Engine]):::proc
    RefinedJSON[[Refined JSON]]:::gold

    EntJSON & Groups --- MergeEngine

    NormalizedJSON --- BioNER --> EntJSON --- EntityRes --> Groups --- SemSim
    SemSim --> DupMap --- MergeEngine --> RefinedJSON

    class NormalizedJSON,EntJSON,Groups,DupMap silver

E. Generation & Validation

Instruction Generation

flowchart TD
    classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
    classDef human fill:#fff0f5,stroke:#C2185B,stroke-width:2px,stroke-dasharray: 5 5,color:#000

    Experts1[\Domain Experts/]:::human
    Template[Sentence Skeleton]:::bronze

    Experts1 --> Template

    RefinedJSON[[Refined JSON]]:::gold

    GeneratorLLM[/Generator LLM\]:::model
    Drafts@{ shape: lin-rect, label: "Draft Instructions" }

    Template --- GeneratorLLM
    RefinedJSON --- GeneratorLLM
    GeneratorLLM --> Drafts

    HeuristicValidator([Heuristic Validator]):::proc
    CandidatePool@{ shape: lin-rect, label: "Heuristic-Passed Pool" }

    Drafts --- HeuristicValidator --> CandidatePool

    Sampler([Sampler]):::proc
    SampleBatch@{ shape: lin-rect, label: "Sample Batch" }

    CandidatePool --- Sampler --> SampleBatch

    ExpertsReview[\Domain Experts Review/]:::human

    SampleBatch --- ExpertsReview

    Promotion([Promotion Engine]):::proc
    DiscardBatch([Discard Engine]):::proc
    HumanDataset[[Human Validated Dataset]]:::gold

    ExpertsReview -->|approve| Promotion --> HumanDataset
    ExpertsReview -->|reject| DiscardBatch

    class Drafts,CandidatePool,SampleBatch silver

Knowledge Graph Construction

flowchart TD
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
    classDef store fill:#f0f7ff,stroke:#1565C0,stroke-width:2px,color:#000

    RefinedJSON[[Refined JSON]]:::gold
    %% EntJSON@{ shape: lin-rect, label: "Labeled Entities" }

    RelExtractor[/Relation Extraction\]:::model
    GraphBuilder([Graph Ingestion]):::proc

    TriplesList@{ shape: lin-rect, label: "Triples List (S-P-O)" }
    KnowledgeGraph[[Knowledge Graph]]:::store
    KGValidator([KG Validator]):::proc
    GraphWalker([Graph Walker]):::proc

    RefinedJSON --- RelExtractor
    %% EntJSON --- RelExtractor
    RelExtractor --> TriplesList --- GraphBuilder --> KnowledgeGraph --- KGValidator & GraphWalker

    class EntJSON,TriplesList silver

Instruction Generation

flowchart TD
    classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
    classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
    classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
    classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
    classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
    classDef human fill:#fff0f5,stroke:#C2185B,stroke-width:2px,stroke-dasharray: 5 5,color:#000
    classDef store fill:#f0f7ff,stroke:#1565C0,stroke-width:2px,color:#000

    Experts[\Domain Experts/]:::human

    Template[Sentence Skeleton]:::bronze
    GraphWalker([Graph Walker]):::proc
    DetDataset[[Deterministic Dataset]]:::gold

    RefinedJSON[[Refined JSON]]:::gold

    Experts --> Template

    GeneratorLLM[/Generator LLM\]:::model
    DraftInstr@{ shape: lin-rect, label: "Draft Instructions" }


    Template --- GeneratorLLM
    Template --- GraphWalker --> DetDataset
    DetDataset --- GeneratorLLM
    GeneratorLLM --> DraftInstr

    RefinedJSON --- GeneratorLLM

    KGValidator([KG Validator]):::proc
    Rejected@{ shape: lin-rect, label: "Rejected" }
    GenDataset[[Generative Dataset]]:::gold

    DraftInstr --- KGValidator
    KGValidator -->|pass| GenDataset
    KGValidator -->|fail| Rejected

    class DraftInstr,Rejected silver