Clarification on `gene_class_dict` Format
The documentation of Geneformer/geneformer/classifier.py
specifies the expected format for the gene_class_dict
parameter as:
gene_class_dict : None, dict
| Gene classes to fine-tune model to distinguish.
| Dictionary in format: {Gene_label_A: list(geneA1, geneA2, ...),
| Gene_label_B: list(geneB1, geneB2, ...)}
| Gene values should be Ensembl IDs.
However, based on the function of Geneformer/geneformer/classifier_utils.py
below:
def label_gene_classes(example, class_id_dict, gene_class_dict):
return [
class_id_dict.get(gene_class_dict.get(token_id, -100), -100)
for token_id in example["input_ids"]
]
It seems that gene_class_dict
is expected to have keys as token_id
(likely Ensembl IDs or similar identifiers) and values as Gene_label
. This contradicts the earlier documentation, which suggests a structure where keys are Gene_label
and values are lists of genes (Ensembl IDs).
Questions
Should
gene_class_dict
be structured as:{Gene_label_A: [geneA1, geneA2, ...], Gene_label_B: [geneB1, geneB2, ...]}
as per the documentation?
Or, should it be structured as:
{geneA1: Gene_label_A, geneA2: Gene_label_A, ...}
as implied by the function
label_gene_classes
?If the intended structure is the first one (as per the documentation), could you clarify how
label_gene_classes
processes such a structure, or provide a corrected example?
Additional Context
- The
example["input_ids"]
in the function seems to interact directly with thegene_class_dict
keys. Ifgene_class_dict
were structured as{Gene_label: [gene1, gene2, ...]}
, the function logic would not align, asget()
operates on individual keys, not lists.
Looking forward to your clarification. Thank you!