Datasets Design
This section documents the design decisions and workflows for the data pipeline. It covers how pharmaceutical data is acquired, cleaned, transformed, and prepared for model training.
📊 Overview
The Datasets module follows a three-stage architecture:
flowchart LR
Sources[Sources] --> Transform[Transform]
Transform --> Features[Features]
Features --> Storage[(Storage)]
| Stage | Purpose | Status |
|---|---|---|
| Sources | Data acquisition from external portals | In Progress |
| Transform | Cleaning, normalization, extraction | Planned |
| Features | Tokenization, embedding, metadata | Planned |
📂 Subsections
- Sources - Scraping and download workflows
- Transform - Data cleaning and extraction pipelines
- Features - Feature engineering (coming soon)
See also: API Reference - Datasets for code documentation.