🇮🇹 🏥 💻 MISTIC

Welcome to MISTIC, a pipeline for Metastases Italian Sentence Transformers Inference Classification.

The research team for this work includes: Livia Lilli, Mario Santoro, Valeria Masiello, Stefano Patarnello, Luca Tagliaferri, Fabio Marazzi, Nikola Dino Capocchiano.

Repository structure

|-- scripts
|   |-- data_to_sentences.py
|   |-- fine_tune.py
|   |-- inference.py
|   |-- sample_data.py
|-- utils
|   |-- classifier.py
|   |-- data_processor.py
|   |-- sentencer.py
|   |-- topic_selector.py
|-- README.md
|-- requirements.txt

How to use

Install requirements

pip install -r requirements.txt

Data Segmentation

python scripts/data_to_sentences.py

The above command performs EHR segmentation. Input data is expected to have "id" and "text" columns, as follows:

id	text

As output, is produced a table of EHR sentences, with the following structure:

id	sent_id	splitted_text

Sampling data for Training and Inference

python scripts/sample_data.py

The above command filters and samples by topic the data for training. Moreover, it applies all the preprocessing pipeline (then the segmentation and the topic filtering) to the input gold standards, for being used in the inference phase. Training data sampling takes as input a dataset of sentences annotated by the SAS text-analytics pipeline. The dataset is required to present the columns "parole chiave" and "livello_categoria_1", where are indicated the key lemmas and the concepts of presence/absence related to those lemmas:

id	sent_id	splitted_text	livello_categoria_1	parole chiave

Gold standars are intended as a subset of EHR manually annotated by experts for the final model evaluation. The GS table must present the following structure:

id	text	gold

Fine-Tuning

python scripts/fine_tune.py

The above command performs the MISTIC fine-tuning for the given input parameters and the training data previously generated. The model checkpoints are saved into the "results" directory.

Inference

python scripts/inference.py

The above command evaluates the fine-tuned MISTIC model on the gold standards at sentence level, previously processed in phase 3. The final classification is then performed at overall EHR level, by making an OR operation among the single sentences' labels. The output table presents the following structure:

id	text	gold	classification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇮🇹 🏥 💻 MISTIC

Repository structure

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
scripts		scripts
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🇮🇹 🏥 💻 MISTIC

Repository structure

How to use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages