Question about the handling of ejection fraction in the dataset

#142
by mriee - opened

Thank you for answering questions enthusiastically in the previous discussions. I noticed this sentence in the Supplementary Information Methods,

We only included non-failing heart samples that had a documented normal ejection fraction.

I would like to ask, has the human_dcm_hcm_nf.dataset provided in Genecorpus-30M been subjected to this process? I noticed during the data processing that the lvef of some rows is nan, for example:

data = load_from_disk(dataset_path)

data[45045]

{'input_ids':  ...,
 'length': 2048,
 'cell_type': 'Cardiomyocyte1',
 'individual': '1540',
 'age': 72.0,
 'sex': 'Female',
 'disease': 'nf',
 'lvef': nan}

data[89755]

{'input_ids': ...,
 'length': 2048,
 'cell_type': 'Cardiomyocyte1',
 'individual': '1603',
 'age': 69.0,
 'sex': 'Female',
 'disease': 'nf',
 'lvef': nan}

In the dataset, I found 16 individuals with nf, two of whom have lvef as nan. However, the text indicates that the number of non-failing is 9+4. So, I would like to confirm the data processing procedure, especially the selection criteria. Thank you!

Thank you for your question! You are correct in your reading of the Methods: "We only included non-failing heart samples that had a documented normal ejection fraction." The two individuals you indicated had no LVEF documented. There is a third that had an abnormal LVEF. The three individuals excluded from the fine-tuning for disease classification were: ["1547","1540","1603"]

ctheodoris changed discussion status to closed

Sign up or log in to comment