Skip to content

Datasets Design

This section documents the design decisions and workflows for the data pipeline. It covers how pharmaceutical data is acquired, cleaned, transformed, and prepared for model training.

📊 Overview

The Datasets module follows a three-stage architecture:

flowchart LR
    Sources[Sources] --> Transform[Transform]
    Transform --> Features[Features]
    Features --> Storage[(Storage)]
Stage Purpose Status
Sources Data acquisition from external portals In Progress
Transform Cleaning, normalization, extraction Planned
Features Tokenization, embedding, metadata Planned

📂 Subsections

  • Sources - Scraping and download workflows
  • Transform - Data cleaning and extraction pipelines
  • Features - Feature engineering (coming soon)

See also: API Reference - Datasets for code documentation.