ctheodoris/Geneformer · Question about the handling of ejection fraction in the dataset

Jul 30, 2023

Thank you for answering questions enthusiastically in the previous discussions. I noticed this sentence in the Supplementary Information Methods,

We only included non-failing heart samples that had a documented normal ejection fraction.

I would like to ask, has the human_dcm_hcm_nf.dataset provided in Genecorpus-30M been subjected to this process? I noticed during the data processing that the lvef of some rows is nan, for example:

data = load_from_disk(dataset_path)

data[45045]

{'input_ids':  ...,
 'length': 2048,
 'cell_type': 'Cardiomyocyte1',
 'individual': '1540',
 'age': 72.0,
 'sex': 'Female',
 'disease': 'nf',
 'lvef': nan}

data[89755]

{'input_ids': ...,
 'length': 2048,
 'cell_type': 'Cardiomyocyte1',
 'individual': '1603',
 'age': 69.0,
 'sex': 'Female',
 'disease': 'nf',
 'lvef': nan}

In the dataset, I found 16 individuals with nf, two of whom have lvef as nan. However, the text indicates that the number of non-failing is 9+4. So, I would like to confirm the data processing procedure, especially the selection criteria. Thank you!

ctheodoris

Owner Jul 31, 2023

Thank you for your question! You are correct in your reading of the Methods: "We only included non-failing heart samples that had a documented normal ejection fraction." The two individuals you indicated had no LVEF documented. There is a third that had an abnormal LVEF. The three individuals excluded from the fine-tuning for disease classification were: ["1547","1540","1603"]

ctheodoris changed discussion status to closed Jul 31, 2023