In Silico Perturbation Stats: Gene ID and Ensembl ID not matching up
Hello,
I am Tokenizing my .h5ad scRNA Seq Data using:
tk = TranscriptomeTokenizer(nproc=16)
tk.tokenize_data(file_path,
file_path,
"token2",
file_format="h5ad")
I then do in silico perturbation using:
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb= ["ENSG00000179750"],
model_type="Pretrained",
num_classes=0,
emb_mode="cls_and_gene",
filter_data= None,
cell_states_to_model=None,
state_embs_dict =None,
max_ncells=None,
emb_layer=0,
forward_batch_size=50,
nproc=8)
Provide the path to the saved dataset directory
isp.perturb_data("/content/Geneformer",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/Tokenizer/token.dataset",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
"APOBEC3B_Deletion3")
I then get stats using:
from geneformer import InSilicoPerturberStats
ispstats = InSilicoPerturberStats(mode="aggregate_gene_shifts",
genes_perturbed=["ENSG00000179750"])
ispstats.get_stats("/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
None,
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926",
"APOBEC3B_Deletion3")
When I open the .csv stats file, it appears that the Ensembl IDs and Gene names do not match up. This is the case for both the perturbed gene and for the affected genes. Is something going wrong with my Tokenizing step?
Thank you very much.
very sorry for the confusion. The Ensembl ID and Gene names do in fact match up, it was just displaying a less well known gene synonym.
Thanks
Ben