ctheodoris/Geneformer · Clarification on `gene_class

The documentation of Geneformer/geneformer/classifier.pyspecifies the expected format for the gene_class_dict parameter as:

gene_class_dict : None, dict
| Gene classes to fine-tune model to distinguish.
| Dictionary in format: {Gene_label_A: list(geneA1, geneA2, ...),
|                        Gene_label_B: list(geneB1, geneB2, ...)}
| Gene values should be Ensembl IDs.

However, based on the function of Geneformer/geneformer/classifier_utils.py below:

def label_gene_classes(example, class_id_dict, gene_class_dict):
    return [
        class_id_dict.get(gene_class_dict.get(token_id, -100), -100)
        for token_id in example["input_ids"]
    ]

It seems that gene_class_dict is expected to have keys as token_id (likely Ensembl IDs or similar identifiers) and values as Gene_label. This contradicts the earlier documentation, which suggests a structure where keys are Gene_label and values are lists of genes (Ensembl IDs).

Questions

Should gene_class_dict be structured as:

{Gene_label_A: [geneA1, geneA2, ...], Gene_label_B: [geneB1, geneB2, ...]}

as per the documentation?

Or, should it be structured as:
```
{geneA1: Gene_label_A, geneA2: Gene_label_A, ...}
```
as implied by the function label_gene_classes?
If the intended structure is the first one (as per the documentation), could you clarify how label_gene_classes processes such a structure, or provide a corrected example?

Additional Context

The example["input_ids"] in the function seems to interact directly with the gene_class_dict keys. If gene_class_dict were structured as {Gene_label: [gene1, gene2, ...]}, the function logic would not align, as get() operates on individual keys, not lists.

Looking forward to your clarification. Thank you!

ctheodoris
/

Geneformer

Clarification on `gene_class_dict` Format

Questions

Additional Context