scMulan Model
Model Details
- Model Name: scMulan
- Version: 1.0 [deeplife version]
- Type: Foundation model for single-cell biolog
- Original Paper: scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis
- Original Implementation: scMulan GitHub Repository
Model Description
scMulan is a foundation and generative model for single-cell gene expression.
Intended Use
scMulan is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:
- Zero-shot cell type annotations
- Zero-shot batch integration
- Conditional cell generation
Training Data
The model was trained on a subset of the hECA dataset named hECA-10M. It includes more than 10 million high-quality single-cell transcriptome data from vital human organs or tissues. The 2000 most highly variable genes across the dataset were selected. Each transcriptome is accompanied by metadata attributes including organ, donor age, donor gender, sequencing technology, and cell type.
Performance
- In the paper, scMulan achieves better cell type prediction accuracy than scGPT, Geneformer and CellTypist.
- It is competitive with a finetuned scGPT model on batch integration, and performs better than the other tested models.
- Conditional generation quality is evaluated through Q-Q plots and UMAPs.
Limitations
- The pretrained model has only seen 2000 genes.
- The generated data has greater cell sparsity than real data.
- Information is missing from the authors' GitHub on how to run the model for generation.
Ethical Considerations
Users should be aware that while the data used to train scMulan is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.
Usage
To use the scMulan model within the DeepLife ML Infra:
Install the package:
pip install deeplife-mlinfra
Import and use the model:
import anndata as ad from huggingface_hub import hf_hub_download from dl_models.models.scmulan.model import ScMulanModel from dl_models.models.scmulan.processor import ScMulanProcessor # Load the model and preprocessor model = ScMulandModel.from_pretrained("deeplife/scmulan_model") preprocessor = ScMulanProcessor.from_pretrained("deeplife/scmulan_model") model.eval() # Load your data (example using a sample dataset) filepath = hf_hub_download( repo_id="deeplife/h5ad_samples", filename="GSE136831small.h5ad", repo_type="dataset", ) adata = ad.read_h5ad(filepath) # Preprocess and create a dataloader dataloader = preprocessor.transform_to_dataloader(adata, batch_size = 256) # Get embeddings and cell type predictions for batch in dataloader: coarse_cell_types, fine_cell_types, hidden = model.get_cell_types_and_embeddings(batch) print(coarse_cell_types) print(fine_cell_types) break
For more detailed usage instructions, please refer to the documentation.
Citation
If you use this model in your research, please cite both the original scMulan paper and the DeepLife ML Infra package:
@InProceedings{10.1007/978-1-0716-3989-4_57,
author="Bian, Haiyang and Chen, Yixin and Dong, Xiaomin and Li, Chen and Hao, Minsheng and Chen, Sijie and Hu, Jinyi and Sun, Maosong and Wei, Lei and Zhang, Xuegong",
editor="Ma, Jian",
title="scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis",
booktitle="Research in Computational Molecular Biology",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="479--482",
isbn="978-1-0716-3989-4"
}
@software{deeplife_mlinfra,
title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
author={DeepLife AI Team},
year={2023},
url={https://github.com/deeplifeai/deeplife-mlinfra},
version={1.0.0}
}
Contact
For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the repository.
For questions about the original scMulan model, please contact the authors of the paper.
- Downloads last month
- 10