Question about the handling of ejection fraction in the dataset
Thank you for answering questions enthusiastically in the previous discussions. I noticed this sentence in the Supplementary Information Methods,
We only included non-failing heart samples that had a documented normal ejection fraction.
I would like to ask, has the human_dcm_hcm_nf.dataset provided in Genecorpus-30M been subjected to this process? I noticed during the data processing that the lvef of some rows is nan, for example:
data = load_from_disk(dataset_path)
data[45045]
{'input_ids': ...,
'length': 2048,
'cell_type': 'Cardiomyocyte1',
'individual': '1540',
'age': 72.0,
'sex': 'Female',
'disease': 'nf',
'lvef': nan}
data[89755]
{'input_ids': ...,
'length': 2048,
'cell_type': 'Cardiomyocyte1',
'individual': '1603',
'age': 69.0,
'sex': 'Female',
'disease': 'nf',
'lvef': nan}
In the dataset, I found 16 individuals with nf, two of whom have lvef as nan. However, the text indicates that the number of non-failing is 9+4. So, I would like to confirm the data processing procedure, especially the selection criteria. Thank you!
Thank you for your question! You are correct in your reading of the Methods: "We only included non-failing heart samples that had a documented normal ejection fraction." The two individuals you indicated had no LVEF documented. There is a third that had an abnormal LVEF. The three individuals excluded from the fine-tuning for disease classification were: ["1547","1540","1603"]