In_silico_perturber runtime

#137
by FarzanehN - opened

I used the in_silico_perturber module to evaluate the impact of gene deletions on shifting the embedding of DCM towards a normal state. To prevent out-of-memory (OOM) errors, I carefully selected the following parameters:

perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb='all',
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=2,
emb_mode="cell",
cell_emb_style="mean_pool",
filter_data=None,
cell_states_to_model={"cell_type":(["dcm"],["normal"],[])},
max_ncells=200,
emb_layer=0,
forward_batch_size=10,
nproc=16

Running the code on a NVIDIA Tesla V100 took a considerable amount of time, around 4 hours. Given the current setup, it won't be feasible to run it on 780k cells even with 4 GPUs. Is this long runtime expected? What do you recommend to do to reduce the runtime? Thanks

Thank you for your question! The batch size you are using is extremely small. I would expect batch sizes of 200+ on a 40G V100. Are you running out of memory with larger than 10?

That's correct. I am running it on V100 with 16GB RAM.

Thank you for clarifying your GPU size. Can you monitor with nvtop and confirm the GPU cache is emptied between batches? If not, please let me know. I also assume you are using the 6 layer model rather than the 12 layer model, but if you are not, you may consider using the 6 layer model given the resource limitation. You can also try distributing the job with Deepspeed if you have additional 16G GPUs. This will distribute the model so that you don't need to replicate the model on each GPU and you can have more room for batches.

ctheodoris changed discussion status to closed

Thank you for your input. To execute the in-silico-pertuber.py using DeepSpeed-Inference, I integrated the subsequent command within the load_model function:

model = deepspeed.init_inference(model,
exchange_with_kernel_inject=True,
tensor_partitioned={"activated":True, "partition_size":4, "multi_processing_unit":None, "partition_group":None},
enable_cuda_graph=True,
data_type=torch.float,
)

But, I am getting the "RuntimeError: Tensor a's size (512) must align with tensor b's size (0) along the shared dimension 1". Would you be able to provide any suggestions?
FYI, I am running on the recently updated codes.

FarzanehN changed discussion status to open

I estimated that with max_ncells=None, running on eight T4 GPUs, it would take 24 days to process all cells.

Hello, I have tried the in_silico_perturbation.ipynb code located in the example folder. I used the dataset located at ./Geneformers-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset as the input file. The maximum VRAM usage is approximately 18GB when I set the forward_batch_size to 100 in your code. So, if you are using a 16G V100, I suggest trying a batch size slightly smaller than 100 for better speed, instead of using 10. I hope this helps.

Thank you all for your input on this discussion. Regarding Deepspeed, we have used Deepspeed for training but not for inference. I would suggest finding out which step in the code is causing the tensor mismatch. Sometimes if there is a remainder batch that is just 1 tensor, the dimensions will be different than expected from the other batches that were 3 dimension tensors, so this will cause an issue when stacking them together. If you are distributing the model with Deepspeed, it should allow larger batch sizes that would possibly be less likely to cause this issue. If you are not able to run larger batch sizes with Deepspeed than if you don't use Deepspeed, I would suggest looking into how it's setting up the analysis to ensure the model is distributed properly. Also, as mentioned before, if you are using the 12 layer model and encountering memory limitations, I would suggest you trial the 6 layer model (outer directory of this repository) which will be more memory-efficient.

ctheodoris changed discussion status to closed

Sign up or log in to comment