1 Data Acquisition and Preparation
A. Data Sources & Ingestion
flowchart TD
classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef external fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#000
ANVISA{{"ANVISA Drugs"}}:::external
Scraper([Web Scraper & Indexer]):::proc
Downloader([Batch PDF Downloader]):::proc
MetaLinks["URL List & Metadata"]:::bronze
PDFs[PDF Files]:::bronze
ANVISA --- Scraper --> MetaLinks --- Downloader --> PDFs
click ANVISA "https://consultas.anvisa.gov.br/#/bulario/" "Access ANVISA Electronic Drug Information" _blank
click Scraper "../../../reference/datasets/sources/anvisa/categories" "View Scraper Code"
B. Segmentation & Extraction
flowchart TD
classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
PDFs[PDF Files]:::bronze
LayoutModel[/Layout Analysis\]:::model
OCREngine[/OCR Engine\]:::model
BBoxJSON@{ shape: lin-rect, label: "Layout Bounding Boxes" }
TextBlocks@{ shape: lin-rect, label: "Extracted Text Blocks" }
PDFs --- LayoutModel --> BBoxJSON --- OCREngine
PDFs --- OCREngine --> TextBlocks
class BBoxJSON,TextBlocks silver
C. Normalization & Structuring
flowchart TD
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
TextBlocks@{ shape: lin-rect, label: "Extracted Text Blocks" }
SectionTagger[/Section Tagger\]:::model
TextNormalizer([Text Normalizer]):::proc
TaggedJSON@{ shape: lin-rect, label: "Tagged JSON" }
NormalizedJSON@{ shape: lin-rect, label: "Normalized JSON" }
TextBlocks --- SectionTagger --> TaggedJSON --- TextNormalizer --> NormalizedJSON
class TextBlocks,TaggedJSON,NormalizedJSON silver
D. Curation & Consolidation
Essa etapa é responsável por dar sentido aos dados extraídos e normalizados em formato JSON. Por conter alto valor semântico, a intensão era usar isso para realizar a deduplicação e consolidação dos dados de forma que nenhuma informação fosse perdida ou permanesesse duplicada, mesmo que escrita de forma diferente. Sabendo a natureza de cada dado, o que eles significam e como se relacionam, o dado final teria as versões equivalentes separadas para uso futuro e uma versao central com todas as informacoes extraiadas.
flowchart TD
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
NormalizedJSON@{ shape: lin-rect, label: "Normalized JSON" }
BioNER[/Biomedical NER\]:::model
EntJSON@{ shape: lin-rect, label: "Labeled Entities" }
EntityRes([Entity Resolution]):::proc
Groups@{ shape: lin-rect, label: "Entity Groups" }
SemSim[/Semantic Similarity\]:::model
DupMap@{ shape: lin-rect, label: "Duplicate Mapping" }
MergeEngine([Merge Engine]):::proc
RefinedJSON[[Refined JSON]]:::gold
EntJSON & Groups --- MergeEngine
NormalizedJSON --- BioNER --> EntJSON --- EntityRes --> Groups --- SemSim
SemSim --> DupMap --- MergeEngine --> RefinedJSON
class NormalizedJSON,EntJSON,Groups,DupMap silver
E. Generation & Validation
Instruction Generation
flowchart TD
classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
classDef human fill:#fff0f5,stroke:#C2185B,stroke-width:2px,stroke-dasharray: 5 5,color:#000
Experts1[\Domain Experts/]:::human
Template[Sentence Skeleton]:::bronze
Experts1 --> Template
RefinedJSON[[Refined JSON]]:::gold
GeneratorLLM[/Generator LLM\]:::model
Drafts@{ shape: lin-rect, label: "Draft Instructions" }
Template --- GeneratorLLM
RefinedJSON --- GeneratorLLM
GeneratorLLM --> Drafts
HeuristicValidator([Heuristic Validator]):::proc
CandidatePool@{ shape: lin-rect, label: "Heuristic-Passed Pool" }
Drafts --- HeuristicValidator --> CandidatePool
Sampler([Sampler]):::proc
SampleBatch@{ shape: lin-rect, label: "Sample Batch" }
CandidatePool --- Sampler --> SampleBatch
ExpertsReview[\Domain Experts Review/]:::human
SampleBatch --- ExpertsReview
Promotion([Promotion Engine]):::proc
DiscardBatch([Discard Engine]):::proc
HumanDataset[[Human Validated Dataset]]:::gold
ExpertsReview -->|approve| Promotion --> HumanDataset
ExpertsReview -->|reject| DiscardBatch
class Drafts,CandidatePool,SampleBatch silver
Knowledge Graph Construction
flowchart TD
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
classDef store fill:#f0f7ff,stroke:#1565C0,stroke-width:2px,color:#000
RefinedJSON[[Refined JSON]]:::gold
%% EntJSON@{ shape: lin-rect, label: "Labeled Entities" }
RelExtractor[/Relation Extraction\]:::model
GraphBuilder([Graph Ingestion]):::proc
TriplesList@{ shape: lin-rect, label: "Triples List (S-P-O)" }
KnowledgeGraph[[Knowledge Graph]]:::store
KGValidator([KG Validator]):::proc
GraphWalker([Graph Walker]):::proc
RefinedJSON --- RelExtractor
%% EntJSON --- RelExtractor
RelExtractor --> TriplesList --- GraphBuilder --> KnowledgeGraph --- KGValidator & GraphWalker
class EntJSON,TriplesList silver
Instruction Generation
flowchart TD
classDef bronze fill:#faf6f0,stroke:#8B4513,stroke-width:2px,color:#000
classDef silver fill:#f5f5f5,stroke:#607D8B,stroke-width:2px,color:#000
classDef gold fill:#fffbf0,stroke:#FF8F00,stroke-width:2px,color:#000
classDef proc fill:#f0fff4,stroke:#00A843,stroke-width:2px,color:#000
classDef model fill:#f8f0ff,stroke:#7E57C2,stroke-width:2px,stroke-dasharray: 5 5,color:#000
classDef human fill:#fff0f5,stroke:#C2185B,stroke-width:2px,stroke-dasharray: 5 5,color:#000
classDef store fill:#f0f7ff,stroke:#1565C0,stroke-width:2px,color:#000
Experts[\Domain Experts/]:::human
Template[Sentence Skeleton]:::bronze
GraphWalker([Graph Walker]):::proc
DetDataset[[Deterministic Dataset]]:::gold
RefinedJSON[[Refined JSON]]:::gold
Experts --> Template
GeneratorLLM[/Generator LLM\]:::model
DraftInstr@{ shape: lin-rect, label: "Draft Instructions" }
Template --- GeneratorLLM
Template --- GraphWalker --> DetDataset
DetDataset --- GeneratorLLM
GeneratorLLM --> DraftInstr
RefinedJSON --- GeneratorLLM
KGValidator([KG Validator]):::proc
Rejected@{ shape: lin-rect, label: "Rejected" }
GenDataset[[Generative Dataset]]:::gold
DraftInstr --- KGValidator
KGValidator -->|pass| GenDataset
KGValidator -->|fail| Rejected
class DraftInstr,Rejected silver