Problems in replicating in silico treatment analysis

#140
by mriee - opened

Thank you for your outstanding work. I encountered difficulties while trying to reproduce your results from the "in silico treatment analysis". I set up my configuration according to the in_silico_perturbation.ipynb:

isp = InSilicoPerturber(perturb_type="delete",
                        perturb_rank_shift=None,
                        genes_to_perturb="all",
                        combos=0,
                        anchor_gene=None,
                        model_type="CellClassifier",
                        num_classes=3,
                        emb_mode="cell",
                        cell_emb_style="mean_pool",
                        filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
                        cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])},
                        max_ncells=None,
                        emb_layer=0,
                        forward_batch_size=400,
                        nproc=20)
isp.perturb_data("path/to/fine-tuned-model",
                 "path/to/Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
                 "path/to/perturb_output_directory",
                 "output_prefix")
ispstats = InSilicoPerturberStats(mode="goal_state_shift",
                                  genes_perturbed="all",
                                  combos=0,
                                  anchor_gene=None,
                                  cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])})
ispstats.get_stats("path/to/perturb_output_directory",
                   None,
                   "path/to/stats_output_directory",
                   "stats_prefix")

However, there is a significant discrepancy between the Shift results I obtained and those in the DCM_del_tx worksheet of Supplementary Table 12. I also noticed that there is a large difference between the N_Detections and #_Detections in the results. For example, for the gene RYR2 (Gene_Token=17173, Ensembl_ID=ENSG00000198626), the N_Detections in my results was 29981, whereas in Table 12, the #_Detections was 9611. I understand that the paper may have selected a subset. So, I counted the number of times RYR2 appeared in different individuals with dcm among all cardiomyocytes (cell_type=Cardiomyocyte1/2/3 and disease=dcm):

# 'individual': counts
{'1430': 2091,
 '1371': 4578,
 '1300': 1884,
 '1617': 3152,
 '1290': 1818,
 '1437': 2889,
 '1358': 3739,
 '1304': 4280,
 '1504': 1115,
 '1472': 3850,
 '1606': 585}

As you can see, their sum is 29981, but I can't figure out how to add up to 9611. Did I miss something in the data processing? Thank you!

mriee changed discussion title from Problems in replicating data processing in treatment analysis to Problems in replicating in silico treatment analysis

Thank you for your question and interest in Geneformer! We did not have the code packaged into modules the way it is now when we ran our analyses (we packaged it afterwards to make it easier for others to use) but the way you are setting up the in silico perturber is consistent with how we ran the analyses. Although we did not select a particular subset of individuals for the analysis, our cluster had a max job time limit so we had to run a subset of cells from the dataset as a whole for the analysis. One thing to keep in mind is because the statistics depend on the number of detections, running more cells may lead to a higher number of genes being called as significant.

ctheodoris changed discussion status to closed

Shouldnt the input for ispstats.get_stats() be the intermediate files generated by isp.perturb_data()?

Yes, that is correct. If you run the perturb_data part with more cells, then the result from more cells will be available to the get_stats part, so the number of detections will be higher because the genes will be observed more times in more cells.

Shouldnt the input for ispstats.get_stats() be the intermediate files generated by isp.perturb_data()?

I made a mistake in composing my post, I actually ran the following code:

ispstats.get_stats("path/to/output_directory",
                   None,
                   "path/to/output_directory",
                   "stats_prefix")

It reads the cos similarity data from the output of the first step and then saves the stats in the same directory in csv format. I have corrected the relevant statement in the middle of my post.

Sign up or log in to comment