9 days ago

Hello,

I am Tokenizing my .h5ad scRNA Seq Data using:
tk = TranscriptomeTokenizer(nproc=16)
tk.tokenize_data(file_path,
file_path,
"token2",
file_format="h5ad")

I then do in silico perturbation using:
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb= ["ENSG00000179750"],
model_type="Pretrained",
num_classes=0,
emb_mode="cls_and_gene",
filter_data= None,
cell_states_to_model=None,
state_embs_dict =None,
max_ncells=None,
emb_layer=0,
forward_batch_size=50,
nproc=8)

Provide the path to the saved dataset directory

isp.perturb_data("/content/Geneformer",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/Tokenizer/token.dataset",
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
"APOBEC3B_Deletion3")

I then get stats using:
from geneformer import InSilicoPerturberStats
ispstats = InSilicoPerturberStats(mode="aggregate_gene_shifts",
genes_perturbed=["ENSG00000179750"])

ispstats.get_stats("/content/drive/MyDrive/Cov2-scRNASeq_GSE145926/In_Silico3",
None,
"/content/drive/MyDrive/Cov2-scRNASeq_GSE145926",
"APOBEC3B_Deletion3")

When I open the .csv stats file, it appears that the Ensembl IDs and Gene names do not match up. This is the case for both the perturbed gene and for the affected genes. Is something going wrong with my Tokenizing step?

Thank you very much.

bfixman

about 16 hours ago

very sorry for the confusion. The Ensembl ID and Gene names do in fact match up, it was just displaying a less well known gene synonym.
Thanks
Ben

bfixman changed discussion status to closed about 16 hours ago

ctheodoris
/

Geneformer

In Silico Perturbation Stats: Gene ID and Ensembl ID not matching up

Provide the path to the saved dataset directory