Christina Theodoris
commited on
Commit
·
c34ead6
1
Parent(s):
d468697
Add further explanation regarding input file format for transcriptome tokenizer
Browse files
examples/tokenizing_scRNAseq_data.ipynb
CHANGED
@@ -17,7 +17,7 @@
|
|
17 |
"source": [
|
18 |
"#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
|
19 |
"\n",
|
20 |
-
"#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens.\n",
|
21 |
"\n",
|
22 |
"#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
|
23 |
"\n",
|
|
|
17 |
"source": [
|
18 |
"#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
|
19 |
"\n",
|
20 |
+
"#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens. Cells should be labeled with the total read count in the cell (column attribute \"n_counts\") to be used for normalization.\n",
|
21 |
"\n",
|
22 |
"#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
|
23 |
"\n",
|
geneformer/tokenizer.py
CHANGED
@@ -1,6 +1,13 @@
|
|
1 |
"""
|
2 |
Geneformer tokenizer.
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
Usage:
|
5 |
from geneformer import TranscriptomeTokenizer
|
6 |
tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
|
|
|
1 |
"""
|
2 |
Geneformer tokenizer.
|
3 |
|
4 |
+
Input data:
|
5 |
+
Required format: raw counts scRNAseq data without feature selection as .loom file
|
6 |
+
Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
|
7 |
+
Required col (cell) attribute: "n_counts"; total read counts in that cell
|
8 |
+
Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
|
9 |
+
Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
|
10 |
+
|
11 |
Usage:
|
12 |
from geneformer import TranscriptomeTokenizer
|
13 |
tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
|