DrugSLM - Small Language Model for Drug Information

Master's Thesis Project | Federal University of Paraná (UFPR) | Computer Science Department

DrugSLM is a specialized Small Language Model (SLM) trained on drug package inserts and other pharmacological databases, designed to understand and generate accurate and simple pharmaceutical information.

🎓 Academic Context

This project is part of a Master's thesis in Computer Science at the Federal University of Paraná (UFPR), Curitiba, Brazil. The research focuses on:

Democratize access to complex pharmacological information
Transform pharmaceutical documentation data into instruction datasets
Develop a method to validate instructions and responses
Expand the vocabulary of tokenization and embedding models in the pharmacological domain in Portuguese
Specialize small language models in the pharmacological domain
Apply improvement strategies with efficient use of resources
Establishment of a Safety and Ethical-Clinical Alignment Framework

Researcher: Vinícius de Lima Gonçalves
Advisor: Professor Eduardo Todt, PhD
Institution: Department of Computer Science, UFPR

🎯 Project Vision

Reliable and high-quality results in small language models are likely directly related to the quality of the data used to train these models. To ensure this, the data needs to be carefully extracted, structured, and processed, using labeling techniques and classic artificial intelligence techniques, so that it is possible to classify the instructions generated for training as true or incorrect facts, thus allowing training to occur.

🧬 Project Lifecycle and Roadmap


%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'arial' }}}%%

flowchart LR

    classDef phase fill:#f0f4f8,stroke:#2c3e50,stroke-width:1px,color:#2c3e50, text-decoration: none;

    P1(Data Acquisition</br>& Preparation):::phase
    P2(System Design</br>& Modeling):::phase
    P3(Traning</br>& Optimization):::phase
    P4(Evaluation</br>& Validation):::phase
    P5(Qualitative Assessment):::phase

    P1 ==> P2 ==> P3 ==> P4 ==> P5

    click P1 "architecture/roadmap/#phase-1-data-acquisition-and-preparation" "Go to Phase 1: Data Acquisition and Preparation"
    click P2 "architecture/roadmap/#phase-2-modeling-and-system-design" "Go to Phase 2: Modeling and System Design"
    click P3 "architecture/roadmap/#phase-3-training-and-optimization" "Go to Phase 3: Training and Optimization"
    click P4 "architecture/roadmap/#phase-4-evaluation-and-validation" "Go to Phase 4: Evaluation and Validation"
    click P5 "architecture/roadmap/#phase-5-qualitative-assessment" "Go to Phase 5: Qualitative Assessment"

Explore the detailed lineage regarding extraction, transformation, training strategies, and validation metrics for each phase by clicking on the nodes below.