In Silico Perturbation Stats: Gene ID and Ensembl ID not matching up

#465
by bfixman - opened

Hello,

I am Tokenizing my .h5ad scRNA Seq Data using:
tk = TranscriptomeTokenizer(nproc=16)
tk.tokenize_data(file_path,
file_path,
"token2",
file_format="h5ad")

I then do in silico perturbation using:
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb= ["ENSG00000179750"],
model_type="Pretrained",
num_classes=0,
emb_mode="cls_and_gene",
filter_data= None,
cell_states_to_model=None,
state_embs_dict =None,
max_ncells=None,
emb_layer=0,
forward_batch_size=50,
nproc=8)

Provide the path to the saved dataset directory

isp.perturb_data("/content/Geneformer",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/Tokenizer/token.dataset",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
"APOBEC3B_Deletion3")

I then get stats using:
from geneformer import InSilicoPerturberStats
ispstats = InSilicoPerturberStats(mode="aggregate_gene_shifts",
genes_perturbed=["ENSG00000179750"])

ispstats.get_stats("/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
None,
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926",
"APOBEC3B_Deletion3")

When I open the .csv stats file, it appears that the Ensembl IDs and Gene names do not match up. This is the case for both the perturbed gene and for the affected genes. Is something going wrong with my Tokenizing step?

Thank you very much.

very sorry for the confusion. The Ensembl ID and Gene names do in fact match up, it was just displaying a less well known gene synonym.
Thanks
Ben

bfixman changed discussion status to closed

Sign up or log in to comment