How Effective is In-Context Learning with Large Language Models for Rare Cell Identification in Single-Cell Expression Data?
The recent development of single-cell genomics requires more powerful computational tools to differentiate between different phenotypes. Rare cell identification has been one of the most important challenges in this area. Traditional data-driven approaches typically rely on feature selection techniques to identify key genes for anomaly detection, often requiring extensive training data or domain-specific knowledge.
In contrast, large language models (LLMs) have demonstrated strong generalization abilities in various scientific research fields, presenting new opportunities for rare cell identification. This repository accompanies our paper, where we conduct the first comprehensive evaluation of in-context learning with LLMs for rare cell identification. Our approach employs a chain-of-thought prompting strategy, integrating latent space analysis and cross-query comparisons to generate scores for identifying rare cells.
- First evaluation of LLMs for rare cell identification using in-context learning.
- Novel prompting strategy combining chain-of-thought reasoning with latent space analysis and cross-query comparisons.
- Competitive performance of LLMs compared with traditional optimization-based methods on benchmark datasets.
- Minimal dependence on extensive training data or expert-defined feature selection, demonstrating the generalization potential of LLMs in genomics.
├── data/ # Benchmark datasets for rare cell identification
├── src/ # Implementation of our methodology
│ ├── preprocessing.py # Data preprocessing scripts
│ ├── llm_prompting.py # Chain-of-thought prompting strategy
│ ├── evaluation.py # Performance evaluation scripts
├── results/ # Experimental results and analysis
├── README.md # Project documentation
└── requirements.txt # Required dependencies
To set up the environment, clone this repository and install the required dependencies:
$ cd RareCellAgent
$ pip install -r requirements.txtPrepare the single-cell expression datasets and apply preprocessing:
$ python src/preprocessing.py --input data/raw_data.csv --output data/processed_data.csvExecute the LLM-based rare cell identification pipeline:
$ python src/llm_prompting.py --input data/processed_data.csv --output results/llm_predictions.csvAssess the performance of the LLM-based approach against traditional methods:
$ python src/evaluation.py --predictions results/llm_predictions.csv --ground_truth data/labels.csvWe evaluate our approach on publicly available single-cell expression datasets, including:
- Chung
- Darmanis
- Goolam
- Immuno