Christina Theodoris commited on
Commit
2a0dcbe
·
1 Parent(s): 10d3f10

add sphinx docs

Browse files
README.md CHANGED
@@ -3,28 +3,30 @@ datasets: ctheodoris/Genecorpus-30M
3
  license: apache-2.0
4
  ---
5
  # Geneformer
6
- Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
7
 
8
  See [our manuscript](https://rdcu.be/ddrx0) for details.
9
 
10
  # Model Description
11
- Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
12
 
13
  The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.
14
 
15
- We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).
16
 
17
- During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. Fine-tuning Geneformer towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modeling with limited patient data, Geneformer identified candidate therapeutic targets. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
18
 
19
  In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
20
 
 
 
21
  # Application
22
  The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
23
 
24
  Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
25
 
26
  *Fine-tuning*:
27
- - transcription factor dosage sensitivity
28
  - chromatin dynamics (bivalently marked promoters)
29
  - transcription factor regulatory range
30
  - gene network centrality
@@ -64,6 +66,6 @@ For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main
64
  - extracting and plotting cell embeddings
65
  - in silico perturbation
66
 
67
- Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.
68
 
69
- Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).
 
3
  license: apache-2.0
4
  ---
5
  # Geneformer
6
+ Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
7
 
8
  See [our manuscript](https://rdcu.be/ddrx0) for details.
9
 
10
  # Model Description
11
+ Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
12
 
13
  The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.
14
 
15
+ We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).
16
 
17
+ During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.
18
 
19
  In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
20
 
21
+ Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
22
+
23
  # Application
24
  The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
25
 
26
  Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
27
 
28
  *Fine-tuning*:
29
+ - transcription factor dosage sensitivity
30
  - chromatin dynamics (bivalently marked promoters)
31
  - transcription factor regulatory range
32
  - gene network centrality
 
66
  - extracting and plotting cell embeddings
67
  - in silico perturbation
68
 
69
+ Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.
70
 
71
+ Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).
docs/Makefile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Minimal makefile for Sphinx documentation
2
+ #
3
+
4
+ # You can set these variables from the command line, and also
5
+ # from the environment for the first two.
6
+ SPHINXOPTS ?=
7
+ SPHINXBUILD ?= sphinx-build
8
+ SOURCEDIR = source
9
+ BUILDDIR = build
10
+
11
+ # Put it first so that "make" without argument is like "make help".
12
+ help:
13
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14
+
15
+ .PHONY: help Makefile
16
+
17
+ # Catch-all target: route all unknown targets to Sphinx using the new
18
+ # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19
+ %: Makefile
20
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
docs/make.bat ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @ECHO OFF
2
+
3
+ pushd %~dp0
4
+
5
+ REM Command file for Sphinx documentation
6
+
7
+ if "%SPHINXBUILD%" == "" (
8
+ set SPHINXBUILD=sphinx-build
9
+ )
10
+ set SOURCEDIR=source
11
+ set BUILDDIR=build
12
+
13
+ %SPHINXBUILD% >NUL 2>NUL
14
+ if errorlevel 9009 (
15
+ echo.
16
+ echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17
+ echo.installed, then set the SPHINXBUILD environment variable to point
18
+ echo.to the full path of the 'sphinx-build' executable. Alternatively you
19
+ echo.may add the Sphinx directory to PATH.
20
+ echo.
21
+ echo.If you don't have Sphinx installed, grab it from
22
+ echo.https://www.sphinx-doc.org/
23
+ exit /b 1
24
+ )
25
+
26
+ if "%1" == "" goto help
27
+
28
+ %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29
+ goto end
30
+
31
+ :help
32
+ %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33
+
34
+ :end
35
+ popd
docs/source/_static/css/custom.css ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .wy-side-nav-search, .wy-nav-top {
2
+ background: linear-gradient(15deg, #13547a 0%, #80d0c7 100%);
3
+ }
4
+
5
+
6
+ /* unvisited link */
7
+ .wy-nav-content a:link {
8
+ color: #067abd;
9
+ }
10
+
11
+ /* visited link */
12
+ .wy-nav-content a:visited {
13
+ color: #4b827c;
14
+ }
15
+
16
+ /* mouse over link */
17
+ .wy-nav-content a:hover {
18
+ color: #80d0c7;
19
+ }
20
+
21
+ /* selected link */
22
+ .wy-nav-content a:active {
23
+ color: #4b827c;
24
+ }
25
+
26
+
27
+
28
+ .sig.sig-object {
29
+ padding: 5px 5px 5px 5px;
30
+ background-color: #e6e6e6;
31
+ border-style: solid;
32
+ border-color: black;
33
+ border-width: 1px 0;
34
+ }
docs/source/_static/gf_logo.png ADDED
docs/source/about.rst ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ About
2
+ =====
3
+
4
+ Model Description
5
+ -----------------
6
+
7
+ **Geneformer** is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of ~30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.
8
+
9
+ In `our manuscript <https://rdcu.be/ddrx0>`_, we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within the repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
10
+
11
+ Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
12
+
13
+ Application
14
+ -----------
15
+
16
+ The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
17
+
18
+ Example applications demonstrated in `our manuscript <https://rdcu.be/ddrx0>`_ include:
19
+
20
+ | *Fine-tuning*:
21
+ | - transcription factor dosage sensitivity
22
+ | - chromatin dynamics (bivalently marked promoters)
23
+ | - transcription factor regulatory range
24
+ | - gene network centrality
25
+ | - transcription factor targets
26
+ | - cell type annotation
27
+ | - batch integration
28
+ | - cell state classification across differentiation
29
+ | - disease classification
30
+ | - in silico perturbation to determine disease-driving genes
31
+ | - in silico treatment to determine candidate therapeutic targets
32
+
33
+ | *Zero-shot learning*:
34
+ | - batch integration
35
+ | - gene context specificity
36
+ | - in silico reprogramming
37
+ | - in silico differentiation
38
+ | - in silico perturbation to determine impact on cell state
39
+ | - in silico perturbation to determine transcription factor targets
40
+ | - in silico perturbation to determine transcription factor cooperativity
41
+
42
+ Citation
43
+ --------
44
+
45
+ | C V Theodoris #, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor #. `Transfer learning enables predictions in network biology. <https://rdcu.be/ddrx0>`_ *Nature*, 31 May 2023. (# co-corresponding authors)
docs/source/api.rst ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ API
2
+ ===
3
+
4
+ Tokenizer
5
+ ---------
6
+
7
+ .. toctree::
8
+ :maxdepth: 1
9
+
10
+ geneformer.tokenizer
11
+
12
+ Embedding Extractor
13
+ -------------------
14
+
15
+ .. toctree::
16
+ :maxdepth: 1
17
+
18
+ geneformer.emb_extractor
19
+
20
+ In Silico Perturber
21
+ -------------------
22
+
23
+ .. toctree::
24
+ :maxdepth: 1
25
+
26
+ geneformer.in_silico_perturber
27
+
28
+
29
+ In Silico Perturber Stats
30
+ -------------------------
31
+
32
+ .. toctree::
33
+ :maxdepth: 1
34
+
35
+ geneformer.in_silico_perturber_stats
docs/source/conf.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration file for the Sphinx documentation builder.
2
+ #
3
+ # For the full list of built-in configuration values, see the documentation:
4
+ # https://www.sphinx-doc.org/en/master/usage/configuration.html
5
+
6
+ import pathlib
7
+ import re
8
+ import sys
9
+
10
+ from sphinx.ext import autodoc
11
+
12
+ sys.path.insert(0, pathlib.Path(__file__).parents[2].resolve().as_posix())
13
+
14
+
15
+ # -- Project information -----------------------------------------------------
16
+ # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
17
+
18
+ project = "geneformer"
19
+ copyright = "2024, Christina Theodoris"
20
+ author = "Christina Theodoris"
21
+ release = "0.1.0"
22
+ repository_url = "https://huggingface.co/ctheodoris/Geneformer"
23
+
24
+ # -- General configuration ---------------------------------------------------
25
+ # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
26
+
27
+ extensions = [
28
+ "sphinx.ext.autodoc",
29
+ "sphinx.ext.autosummary",
30
+ "nbsphinx",
31
+ "sphinx.ext.viewcode",
32
+ "sphinx.ext.doctest",
33
+ ]
34
+
35
+ templates_path = ["_templates"]
36
+ exclude_patterns = [
37
+ "**.ipynb_checkpoints",
38
+ ]
39
+ autoclass_content = "both"
40
+
41
+
42
+ class MockedClassDocumenter(autodoc.ClassDocumenter):
43
+ def add_line(self, line: str, source: str, *lineno: int) -> None:
44
+ if line == " Bases: :py:class:`object`":
45
+ return
46
+ super().add_line(line, source, *lineno)
47
+
48
+
49
+ autodoc.ClassDocumenter = MockedClassDocumenter
50
+ add_module_names = False
51
+
52
+
53
+ def process_signature(app, what, name, obj, options, signature, return_annotation):
54
+ # loop through each line in the docstring and replace path with
55
+ # the generic path text
56
+ signature = re.sub(r"PosixPath\(.*?\)", "FILEPATH", signature)
57
+ return (signature, None)
58
+
59
+
60
+ def setup(app):
61
+ app.connect("autodoc-process-signature", process_signature)
62
+
63
+
64
+ # -- Options for HTML output -------------------------------------------------
65
+ # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
66
+
67
+ html_theme = "sphinx_rtd_theme"
68
+ html_show_sphinx = False
69
+ html_static_path = ["_static"]
70
+ html_logo = "_static/gf_logo.png"
71
+ html_theme_options = {
72
+ "collapse_navigation": False,
73
+ "sticky_navigation": True,
74
+ "navigation_depth": 3,
75
+ "logo_only": True,
76
+ }
77
+ html_css_files = [
78
+ "css/custom.css",
79
+ ]
80
+ html_show_sourcelink = False
docs/source/geneformer.emb_extractor.rst ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ geneformer.emb\_extractor
2
+ =========================
3
+
4
+ .. automodule:: geneformer.emb_extractor
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+ :exclude-members:
9
+ accumulate_tdigests,
10
+ gen_heatmap_class_colors,
11
+ gen_heatmap_class_dict,
12
+ get_embs,
13
+ label_cell_embs,
14
+ label_gene_embs,
15
+ make_colorbar,
16
+ plot_heatmap,
17
+ plot_umap,
18
+ summarize_gene_embs,
19
+ tdigest_mean,
20
+ tdigest_median,
21
+ test_emb,
22
+ update_tdigest_dict,
23
+ update_tdigest_dict_mean,
24
+ update_tdigest_dict_median,
25
+ valid_option_dict,
26
+ validate_options
docs/source/geneformer.in_silico_perturber.rst ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ geneformer.in\_silico\_perturber
2
+ =======================================
3
+
4
+ .. automodule:: geneformer.in_silico_perturber
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+ :exclude-members: valid_option_dict, validate_options, apply_additional_filters, isp_perturb_all, isp_perturb_set, update_perturbation_dictionary
docs/source/geneformer.in_silico_perturber_stats.rst ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ geneformer.in\_silico\_perturber\_stats
2
+ ==============================================
3
+
4
+ .. automodule:: geneformer.in_silico_perturber_stats
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+ :exclude-members:
9
+ find,
10
+ get_fdr,
11
+ get_gene_list,
12
+ get_impact_component,
13
+ invert_dict,
14
+ isp_aggregate_gene_shifts,
15
+ isp_aggregate_grouped_perturb,
16
+ isp_stats_mixture_model,
17
+ isp_stats_to_goal_state,
18
+ isp_stats_vs_null,
19
+ n_detections,
20
+ read_dict,
21
+ read_dictionaries,
22
+ token_to_gene_name,
23
+ token_tuple_to_ensembl_ids,
24
+ valid_option_dict,
25
+ validate_options
docs/source/geneformer.tokenizer.rst ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ geneformer.tokenizer
2
+ ====================
3
+
4
+ .. automodule:: geneformer.tokenizer
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+ :exclude-members:
9
+ create_dataset,
10
+ tokenize_anndata,
11
+ tokenize_files,
12
+ tokenize_loom,
13
+ rank_genes,
14
+ tokenize_cell
docs/source/getstarted.rst ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Getting Started
2
+ ===============
3
+
4
+ Installation
5
+ ------------
6
+
7
+ Geneformer installation instructions.
8
+
9
+ Make sure you have git-lfs installed (https://git-lfs.com).
10
+
11
+ .. code-block:: bash
12
+
13
+ git lfs install
14
+ git clone https://huggingface.co/ctheodoris/Geneformer
15
+ cd Geneformer
16
+ pip install .
17
+
18
+
19
+ Tutorials
20
+ ---------
21
+
22
+ | See `examples <https://huggingface.co/ctheodoris/Geneformer/tree/main/examples>`_ for:
23
+ | - tokenizing transcriptomes
24
+ | - pretraining
25
+ | - hyperparameter tuning
26
+ | - fine-tuning
27
+ | - extracting and plotting cell embeddings
28
+ | - in silico perturbation
29
+
30
+ Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the `example_input_files directory <https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files>`_ in the dataset repository, but these only represent a few example fine-tuning applications.
31
+
32
+
33
+ Tips
34
+ ----
35
+
36
+ Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).
docs/source/index.rst ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Geneformer
2
+ ==========
3
+
4
+ Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in network biology.
5
+
6
+ See `our manuscript <https://rdcu.be/ddrx0>`_ for details.
7
+
8
+ Table of Contents
9
+ -----------------
10
+
11
+ .. toctree::
12
+ :maxdepth: 2
13
+
14
+ about
15
+ getstarted
16
+ api
geneformer/emb_extractor.py CHANGED
@@ -1,29 +1,11 @@
1
  """
2
  Geneformer embedding extractor.
3
 
4
- Usage:
5
- from geneformer import EmbExtractor
6
- embex = EmbExtractor(model_type="CellClassifier",
7
- num_classes=3,
8
- emb_mode="cell",
9
- cell_emb_style="mean_pool",
10
- gene_emb_style="mean_pool",
11
- filter_data={"cell_type":["cardiomyocyte"]},
12
- max_ncells=1000,
13
- max_ncells_to_plot=1000,
14
- emb_layer=-1,
15
- emb_label=["disease","cell_type"],
16
- labels_to_plot=["disease","cell_type"],
17
- nproc=16,
18
- summary_stat=None)
19
- embs = embex.extract_embs("path/to/model",
20
- "path/to/input_data",
21
- "path/to/output_directory",
22
- "output_prefix")
23
- embex.plot_embs(embs=embs,
24
- plot_style="heatmap",
25
- output_directory="path/to/output_directory",
26
- output_prefix="output_prefix")
27
 
28
  """
29
 
@@ -414,51 +396,69 @@ class EmbExtractor:
414
  Initialize embedding extractor.
415
 
416
  Parameters
417
- ----------
418
  model_type : {"Pretrained","GeneClassifier","CellClassifier"}
419
- Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
420
  num_classes : int
421
- If model is a gene or cell classifier, specify number of classes it was trained to classify.
422
- For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
423
  emb_mode : {"cell","gene"}
424
- Whether to output cell or gene embeddings.
425
  cell_emb_style : "mean_pool"
426
- Method for summarizing cell embeddings.
427
- Currently only option is mean pooling of gene embeddings for given cell.
428
  gene_emb_style : "mean_pool"
429
- Method for summarizing gene embeddings.
430
- Currently only option is mean pooling of contextual gene embeddings for given gene.
431
  filter_data : None, dict
432
- Default is to extract embeddings from all input data.
433
- Otherwise, dictionary specifying .dataset column name and list of values to filter by.
434
  max_ncells : None, int
435
- Maximum number of cells to extract embeddings from.
436
- Default is 1000 cells randomly sampled from input data.
437
- If None, will extract embeddings from all cells.
438
  emb_layer : {-1, 0}
439
- Embedding layer to extract.
440
- The last layer is most specifically weighted to optimize the given learning objective.
441
- Generally, it is best to extract the 2nd to last layer to get a more general representation.
442
- -1: 2nd to last layer
443
- 0: last layer
444
  emb_label : None, list
445
- List of column name(s) in .dataset to add as labels to embedding output.
446
  labels_to_plot : None, list
447
- Cell labels to plot.
448
- Shown as color bar in heatmap.
449
- Shown as cell color in umap.
450
- Plotting umap requires labels to plot.
451
  forward_batch_size : int
452
- Batch size for forward pass.
453
  nproc : int
454
- Number of CPU processes to use.
455
  summary_stat : {None, "mean", "median", "exact_mean", "exact_median"}
456
- If exact_mean or exact_median, outputs only exact mean or median embedding of input data.
457
- If mean or median, outputs only approximated mean or median embedding of input data.
458
- Non-exact recommended if encountering memory constraints while generating goal embedding positions.
459
- Non-exact is slower but more memory-efficient.
460
  token_dictionary_file : Path
461
- Path to pickle file containing token dictionary (Ensembl ID:token).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
462
  """
463
 
464
  self.model_type = model_type
@@ -533,20 +533,32 @@ class EmbExtractor:
533
  Extract embeddings from input data and save as results in output_directory.
534
 
535
  Parameters
536
- ----------
537
  model_directory : Path
538
- Path to directory containing model
539
  input_data_file : Path
540
- Path to directory containing .dataset inputs
541
  output_directory : Path
542
- Path to directory where embedding data will be saved as csv
543
  output_prefix : str
544
- Prefix for output file
545
  output_torch_embs : bool
546
- Whether or not to also output the embeddings as a tensor.
547
- Note, if true, will output embeddings as both dataframe and tensor.
548
  cell_state : dict
549
- Cell state key and value for state embedding extraction.
 
 
 
 
 
 
 
 
 
 
 
 
550
  """
551
 
552
  filtered_input_data = pu.load_and_filter(
@@ -618,41 +630,43 @@ class EmbExtractor:
618
  Extract exact mean or exact median cell state embedding positions from input data and save as results in output_directory.
619
 
620
  Parameters
621
- ----------
622
  cell_states_to_model : None, dict
623
- Cell states to model if testing perturbations that achieve goal state change.
624
- Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
625
- state_key: key specifying name of column in .dataset that defines the start/goal states
626
- start_state: value in the state_key column that specifies the start state
627
- goal_state: value in the state_key column taht specifies the goal end state
628
- alt_states: list of values in the state_key column that specify the alternate end states
629
- For example: {"state_key": "disease",
630
- "start_state": "dcm",
631
- "goal_state": "nf",
632
- "alt_states": ["hcm", "other1", "other2"]}
 
633
  model_directory : Path
634
- Path to directory containing model
635
  input_data_file : Path
636
- Path to directory containing .dataset inputs
637
  output_directory : Path
638
- Path to directory where embedding data will be saved as csv
639
  output_prefix : str
640
- Prefix for output file
641
  output_torch_embs : bool
642
- Whether or not to also output the embeddings as a tensor.
643
- Note, if true, will output embeddings as both dataframe and tensor.
644
 
645
  Outputs
646
- ----------
647
- Outputs state_embs_dict for use with in silico perturber.
648
- Format is dictionary of embedding positions of each cell state to model shifts from/towards.
649
- Keys specify each possible cell state to model.
650
- Values are target embedding positions as torch.tensor.
651
- For example: {"nf": emb_nf,
652
- "hcm": emb_hcm,
653
- "dcm": emb_dcm,
654
- "other1": emb_other1,
655
- "other2": emb_other2}
 
656
  """
657
 
658
  pu.validate_cell_states_to_model(cell_states_to_model)
@@ -708,21 +722,33 @@ class EmbExtractor:
708
  Plot embeddings, coloring by provided labels.
709
 
710
  Parameters
711
- ----------
712
  embs : pandas.core.frame.DataFrame
713
- Pandas dataframe containing embeddings output from extract_embs
714
  plot_style : str
715
- Style of plot: "heatmap" or "umap"
716
  output_directory : Path
717
- Path to directory where plots will be saved as pdf
718
  output_prefix : str
719
- Prefix for output file
720
  max_ncells_to_plot : None, int
721
- Maximum number of cells to plot.
722
- Default is 1000 cells randomly sampled from embeddings.
723
- If None, will plot embeddings from all cells.
724
  kwargs_dict : dict
725
- Dictionary of kwargs to pass to plotting function.
 
 
 
 
 
 
 
 
 
 
 
 
726
  """
727
 
728
  if plot_style not in ["heatmap", "umap"]:
 
1
  """
2
  Geneformer embedding extractor.
3
 
4
+ **Description:**
5
+
6
+ | Extracts gene or cell embeddings.
7
+ | Plots cell embeddings as heatmaps or UMAPs.
8
+ | Generates cell state embedding dictionary for use with InSilicoPerturber.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  """
11
 
 
396
  Initialize embedding extractor.
397
 
398
  Parameters
399
+ ~~~~~~~~~~
400
  model_type : {"Pretrained","GeneClassifier","CellClassifier"}
401
+ | Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
402
  num_classes : int
403
+ | If model is a gene or cell classifier, specify number of classes it was trained to classify.
404
+ | For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
405
  emb_mode : {"cell","gene"}
406
+ | Whether to output cell or gene embeddings.
407
  cell_emb_style : "mean_pool"
408
+ | Method for summarizing cell embeddings.
409
+ | Currently only option is mean pooling of gene embeddings for given cell.
410
  gene_emb_style : "mean_pool"
411
+ | Method for summarizing gene embeddings.
412
+ | Currently only option is mean pooling of contextual gene embeddings for given gene.
413
  filter_data : None, dict
414
+ | Default is to extract embeddings from all input data.
415
+ | Otherwise, dictionary specifying .dataset column name and list of values to filter by.
416
  max_ncells : None, int
417
+ | Maximum number of cells to extract embeddings from.
418
+ | Default is 1000 cells randomly sampled from input data.
419
+ | If None, will extract embeddings from all cells.
420
  emb_layer : {-1, 0}
421
+ | Embedding layer to extract.
422
+ | The last layer is most specifically weighted to optimize the given learning objective.
423
+ | Generally, it is best to extract the 2nd to last layer to get a more general representation.
424
+ | -1: 2nd to last layer
425
+ | 0: last layer
426
  emb_label : None, list
427
+ | List of column name(s) in .dataset to add as labels to embedding output.
428
  labels_to_plot : None, list
429
+ | Cell labels to plot.
430
+ | Shown as color bar in heatmap.
431
+ | Shown as cell color in umap.
432
+ | Plotting umap requires labels to plot.
433
  forward_batch_size : int
434
+ | Batch size for forward pass.
435
  nproc : int
436
+ | Number of CPU processes to use.
437
  summary_stat : {None, "mean", "median", "exact_mean", "exact_median"}
438
+ | If exact_mean or exact_median, outputs only exact mean or median embedding of input data.
439
+ | If mean or median, outputs only approximated mean or median embedding of input data.
440
+ | Non-exact recommended if encountering memory constraints while generating goal embedding positions.
441
+ | Non-exact is slower but more memory-efficient.
442
  token_dictionary_file : Path
443
+ | Path to pickle file containing token dictionary (Ensembl ID:token).
444
+
445
+ Examples
446
+ ~~~~~~~~
447
+
448
+ .. code-block :: python
449
+
450
+ >>> from geneformer import EmbExtractor
451
+ >>> embex = EmbExtractor(model_type="CellClassifier",
452
+ ... num_classes=3,
453
+ ... emb_mode="cell",
454
+ ... filter_data={"cell_type":["cardiomyocyte"]},
455
+ ... max_ncells=1000,
456
+ ... max_ncells_to_plot=1000,
457
+ ... emb_layer=-1,
458
+ ... emb_label=["disease","cell_type"],
459
+ ... labels_to_plot=["disease","cell_type"],
460
+ ... )
461
+
462
  """
463
 
464
  self.model_type = model_type
 
533
  Extract embeddings from input data and save as results in output_directory.
534
 
535
  Parameters
536
+ ~~~~~~~~~~
537
  model_directory : Path
538
+ | Path to directory containing model
539
  input_data_file : Path
540
+ | Path to directory containing .dataset inputs
541
  output_directory : Path
542
+ | Path to directory where embedding data will be saved as csv
543
  output_prefix : str
544
+ | Prefix for output file
545
  output_torch_embs : bool
546
+ | Whether or not to also output the embeddings as a tensor.
547
+ | Note, if true, will output embeddings as both dataframe and tensor.
548
  cell_state : dict
549
+ | Cell state key and value for state embedding extraction.
550
+
551
+ Examples
552
+ ~~~~~~~~
553
+
554
+ .. code-block :: python
555
+
556
+ >>> embs = embex.extract_embs("path/to/model",
557
+ ... "path/to/input_data",
558
+ ... "path/to/output_directory",
559
+ ... "output_prefix",
560
+ ... )
561
+
562
  """
563
 
564
  filtered_input_data = pu.load_and_filter(
 
630
  Extract exact mean or exact median cell state embedding positions from input data and save as results in output_directory.
631
 
632
  Parameters
633
+ ~~~~~~~~~~
634
  cell_states_to_model : None, dict
635
+ | Cell states to model if testing perturbations that achieve goal state change.
636
+ | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
637
+ | state_key: key specifying name of column in .dataset that defines the start/goal states
638
+ | start_state: value in the state_key column that specifies the start state
639
+ | goal_state: value in the state_key column taht specifies the goal end state
640
+ | alt_states: list of values in the state_key column that specify the alternate end states
641
+ | For example:
642
+ | {"state_key": "disease",
643
+ | "start_state": "dcm",
644
+ | "goal_state": "nf",
645
+ | "alt_states": ["hcm", "other1", "other2"]}
646
  model_directory : Path
647
+ | Path to directory containing model
648
  input_data_file : Path
649
+ | Path to directory containing .dataset inputs
650
  output_directory : Path
651
+ | Path to directory where embedding data will be saved as csv
652
  output_prefix : str
653
+ | Prefix for output file
654
  output_torch_embs : bool
655
+ | Whether or not to also output the embeddings as a tensor.
656
+ | Note, if true, will output embeddings as both dataframe and tensor.
657
 
658
  Outputs
659
+ ~~~~~~~
660
+ | Outputs state_embs_dict for use with in silico perturber.
661
+ | Format is dictionary of embedding positions of each cell state to model shifts from/towards.
662
+ | Keys specify each possible cell state to model.
663
+ | Values are target embedding positions as torch.tensor.
664
+ | For example:
665
+ | {"nf": emb_nf,
666
+ | "hcm": emb_hcm,
667
+ | "dcm": emb_dcm,
668
+ | "other1": emb_other1,
669
+ | "other2": emb_other2}
670
  """
671
 
672
  pu.validate_cell_states_to_model(cell_states_to_model)
 
722
  Plot embeddings, coloring by provided labels.
723
 
724
  Parameters
725
+ ~~~~~~~~~~
726
  embs : pandas.core.frame.DataFrame
727
+ | Pandas dataframe containing embeddings output from extract_embs
728
  plot_style : str
729
+ | Style of plot: "heatmap" or "umap"
730
  output_directory : Path
731
+ | Path to directory where plots will be saved as pdf
732
  output_prefix : str
733
+ | Prefix for output file
734
  max_ncells_to_plot : None, int
735
+ | Maximum number of cells to plot.
736
+ | Default is 1000 cells randomly sampled from embeddings.
737
+ | If None, will plot embeddings from all cells.
738
  kwargs_dict : dict
739
+ | Dictionary of kwargs to pass to plotting function.
740
+
741
+ Examples
742
+ ~~~~~~~~
743
+
744
+ .. code-block :: python
745
+
746
+ >>> embex.plot_embs(embs=embs,
747
+ ... plot_style="heatmap",
748
+ ... output_directory="path/to/output_directory",
749
+ ... output_prefix="output_prefix",
750
+ ... )
751
+
752
  """
753
 
754
  if plot_style not in ["heatmap", "umap"]:
geneformer/in_silico_perturber.py CHANGED
@@ -1,28 +1,35 @@
1
  """
2
  Geneformer in silico perturber.
3
 
4
- Usage:
5
- from geneformer import InSilicoPerturber
6
- isp = InSilicoPerturber(perturb_type="delete",
7
- perturb_rank_shift=None,
8
- genes_to_perturb="all",
9
- combos=0,
10
- anchor_gene=None,
11
- model_type="CellClassifier",
12
- num_classes=0,
13
- emb_mode="cell",
14
- cell_emb_style="mean_pool",
15
- filter_data={"cell_type":["cardiomyocyte"]},
16
- cell_states_to_model={"state_key": "disease", "start_state": "dcm", "goal_state": "nf", "alt_states": ["hcm", "other1", "other2"]},
17
- state_embs_dict ={"nf": emb_nf, "hcm": emb_hcm, "dcm": emb_dcm, "other1": emb_other1, "other2": emb_other2},
18
- max_ncells=None,
19
- emb_layer=0,
20
- forward_batch_size=100,
21
- nproc=16)
22
- isp.perturb_data("path/to/model",
23
- "path/to/input_data",
24
- "path/to/output_directory",
25
- "output_prefix")
 
 
 
 
 
 
 
26
  """
27
 
28
  import logging
@@ -94,89 +101,89 @@ class InSilicoPerturber:
94
  Initialize in silico perturber.
95
 
96
  Parameters
97
- ----------
98
- perturb_type : {"delete","overexpress","inhibit","activate"}
99
- Type of perturbation.
100
- "delete": delete gene from rank value encoding
101
- "overexpress": move gene to front of rank value encoding
102
- "inhibit": move gene to lower quartile of rank value encoding
103
- "activate": move gene to higher quartile of rank value encoding
104
- perturb_rank_shift : None, {1,2,3}
105
- Number of quartiles by which to shift rank of gene.
106
- For example, if perturb_type="activate" and perturb_rank_shift=1:
107
- genes in 4th quartile will move to middle of 3rd quartile.
108
- genes in 3rd quartile will move to middle of 2nd quartile.
109
- genes in 2nd quartile will move to middle of 1st quartile.
110
- genes in 1st quartile will move to front of rank value encoding.
111
- For example, if perturb_type="inhibit" and perturb_rank_shift=2:
112
- genes in 1st quartile will move to middle of 3rd quartile.
113
- genes in 2nd quartile will move to middle of 4th quartile.
114
- genes in 3rd or 4th quartile will move to bottom of rank value encoding.
115
  genes_to_perturb : "all", list
116
- Default is perturbing each gene detected in each cell in the dataset.
117
- Otherwise, may provide a list of ENSEMBL IDs of genes to perturb.
118
- If gene list is provided, then perturber will only test perturbing them all together
119
- (rather than testing each possible combination of the provided genes).
120
  combos : {0,1}
121
- Whether to perturb genes individually (0) or in pairs (1).
122
  anchor_gene : None, str
123
- ENSEMBL ID of gene to use as anchor in combination perturbations.
124
- For example, if combos=1 and anchor_gene="ENSG00000148400":
125
- anchor gene will be perturbed in combination with each other gene.
126
- model_type : {"Pretrained","GeneClassifier","CellClassifier"}
127
- Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
128
  num_classes : int
129
- If model is a gene or cell classifier, specify number of classes it was trained to classify.
130
- For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
131
- emb_mode : {"cell","cell_and_gene"}
132
- Whether to output impact of perturbation on cell and/or gene embeddings.
133
- Gene embedding shifts only available as compared to original cell, not comparing to goal state.
134
  cell_emb_style : "mean_pool"
135
- Method for summarizing cell embeddings.
136
- Currently only option is mean pooling of gene embeddings for given cell.
137
  filter_data : None, dict
138
- Default is to use all input data for in silico perturbation study.
139
- Otherwise, dictionary specifying .dataset column name and list of values to filter by.
140
  cell_states_to_model : None, dict
141
- Cell states to model if testing perturbations that achieve goal state change.
142
- Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
143
- state_key: key specifying name of column in .dataset that defines the start/goal states
144
- start_state: value in the state_key column that specifies the start state
145
- goal_state: value in the state_key column taht specifies the goal end state
146
- alt_states: list of values in the state_key column that specify the alternate end states
147
- For example: {"state_key": "disease",
148
- "start_state": "dcm",
149
- "goal_state": "nf",
150
- "alt_states": ["hcm", "other1", "other2"]}
151
  state_embs_dict : None, dict
152
- Embedding positions of each cell state to model shifts from/towards (e.g. mean or median).
153
- Dictionary with keys specifying each possible cell state to model.
154
- Values are target embedding positions as torch.tensor.
155
- For example: {"nf": emb_nf,
156
- "hcm": emb_hcm,
157
- "dcm": emb_dcm,
158
- "other1": emb_other1,
159
- "other2": emb_other2}
160
  max_ncells : None, int
161
- Maximum number of cells to test.
162
- If None, will test all cells.
163
  cell_inds_to_perturb : "all", list
164
- Default is perturbing each cell in the dataset.
165
- Otherwise, may provide a dict of indices of cells to perturb with keys start_ind and end_ind.
166
- start_ind: the first index to perturb.
167
- end_ind: the last index to perturb (exclusive).
168
- Indices will be selected *after* the filter_data criteria and sorting.
169
- Useful for splitting extremely large datasets across separate GPUs.
170
  emb_layer : {-1, 0}
171
- Embedding layer to use for quantification.
172
- 0: last layer (recommended for questions closely tied to model's training objective)
173
- -1: 2nd to last layer (recommended for questions requiring more general representations)
174
  forward_batch_size : int
175
- Batch size for forward pass.
176
  nproc : int
177
- Number of CPU processes to use.
178
  token_dictionary_file : Path
179
- Path to pickle file containing token dictionary (Ensembl ID:token).
180
  """
181
 
182
  self.perturb_type = perturb_type
@@ -392,15 +399,15 @@ class InSilicoPerturber:
392
  Perturb genes in input data and save as results in output_directory.
393
 
394
  Parameters
395
- ----------
396
  model_directory : Path
397
- Path to directory containing model
398
  input_data_file : Path
399
- Path to directory containing .dataset inputs
400
  output_directory : Path
401
- Path to directory where perturbation data will be saved as batched pickle files
402
  output_prefix : str
403
- Prefix for output files
404
  """
405
 
406
  ### format output path ###
 
1
  """
2
  Geneformer in silico perturber.
3
 
4
+ **Usage:**
5
+
6
+ .. code-block :: python
7
+
8
+ >>> from geneformer import InSilicoPerturber
9
+ >>> isp = InSilicoPerturber(perturb_type="delete",
10
+ ... perturb_rank_shift=None,
11
+ ... genes_to_perturb="all",
12
+ ... model_type="CellClassifier",
13
+ ... num_classes=0,
14
+ ... emb_mode="cell",
15
+ ... filter_data={"cell_type":["cardiomyocyte"]},
16
+ ... cell_states_to_model={"state_key": "disease", "start_state": "dcm", "goal_state": "nf", "alt_states": ["hcm", "other1", "other2"]},
17
+ ... state_embs_dict ={"nf": emb_nf, "hcm": emb_hcm, "dcm": emb_dcm, "other1": emb_other1, "other2": emb_other2},
18
+ ... max_ncells=None,
19
+ ... emb_layer=0,
20
+ ... forward_batch_size=100,
21
+ ... nproc=16)
22
+ >>> isp.perturb_data("path/to/model",
23
+ ... "path/to/input_data",
24
+ ... "path/to/output_directory",
25
+ ... "output_prefix")
26
+
27
+ **Description:**
28
+
29
+ | Performs in silico perturbation (e.g. deletion or overexpression) of defined set of genes or all genes in sample of cells.
30
+ | Outputs impact of perturbation on cell or gene embeddings.
31
+ | Output files are analyzed with ``in_silico_perturber_stats``.
32
+
33
  """
34
 
35
  import logging
 
101
  Initialize in silico perturber.
102
 
103
  Parameters
104
+ ~~~~~~~~~~
105
+ perturb_type : {"delete", "overexpress", "inhibit", "activate"}
106
+ | Type of perturbation.
107
+ | "delete": delete gene from rank value encoding
108
+ | "overexpress": move gene to front of rank value encoding
109
+ | *(TBA)* "inhibit": move gene to lower quartile of rank value encoding
110
+ | *(TBA)* "activate": move gene to higher quartile of rank value encoding
111
+ *(TBA)* perturb_rank_shift : None, {1,2,3}
112
+ | Number of quartiles by which to shift rank of gene.
113
+ | For example, if perturb_type="activate" and perturb_rank_shift=1:
114
+ | genes in 4th quartile will move to middle of 3rd quartile.
115
+ | genes in 3rd quartile will move to middle of 2nd quartile.
116
+ | genes in 2nd quartile will move to middle of 1st quartile.
117
+ | genes in 1st quartile will move to front of rank value encoding.
118
+ | For example, if perturb_type="inhibit" and perturb_rank_shift=2:
119
+ | genes in 1st quartile will move to middle of 3rd quartile.
120
+ | genes in 2nd quartile will move to middle of 4th quartile.
121
+ | genes in 3rd or 4th quartile will move to bottom of rank value encoding.
122
  genes_to_perturb : "all", list
123
+ | Default is perturbing each gene detected in each cell in the dataset.
124
+ | Otherwise, may provide a list of ENSEMBL IDs of genes to perturb.
125
+ | If gene list is provided, then perturber will only test perturbing them all together
126
+ | (rather than testing each possible combination of the provided genes).
127
  combos : {0,1}
128
+ | Whether to perturb genes individually (0) or in pairs (1).
129
  anchor_gene : None, str
130
+ | ENSEMBL ID of gene to use as anchor in combination perturbations.
131
+ | For example, if combos=1 and anchor_gene="ENSG00000148400":
132
+ | anchor gene will be perturbed in combination with each other gene.
133
+ model_type : {"Pretrained", "GeneClassifier", "CellClassifier"}
134
+ | Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
135
  num_classes : int
136
+ | If model is a gene or cell classifier, specify number of classes it was trained to classify.
137
+ | For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
138
+ emb_mode : {"cell", "cell_and_gene"}
139
+ | Whether to output impact of perturbation on cell and/or gene embeddings.
140
+ | Gene embedding shifts only available as compared to original cell, not comparing to goal state.
141
  cell_emb_style : "mean_pool"
142
+ | Method for summarizing cell embeddings.
143
+ | Currently only option is mean pooling of gene embeddings for given cell.
144
  filter_data : None, dict
145
+ | Default is to use all input data for in silico perturbation study.
146
+ | Otherwise, dictionary specifying .dataset column name and list of values to filter by.
147
  cell_states_to_model : None, dict
148
+ | Cell states to model if testing perturbations that achieve goal state change.
149
+ | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
150
+ | state_key: key specifying name of column in .dataset that defines the start/goal states
151
+ | start_state: value in the state_key column that specifies the start state
152
+ | goal_state: value in the state_key column taht specifies the goal end state
153
+ | alt_states: list of values in the state_key column that specify the alternate end states
154
+ | For example: {"state_key": "disease",
155
+ | "start_state": "dcm",
156
+ | "goal_state": "nf",
157
+ | "alt_states": ["hcm", "other1", "other2"]}
158
  state_embs_dict : None, dict
159
+ | Embedding positions of each cell state to model shifts from/towards (e.g. mean or median).
160
+ | Dictionary with keys specifying each possible cell state to model.
161
+ | Values are target embedding positions as torch.tensor.
162
+ | For example: {"nf": emb_nf,
163
+ | "hcm": emb_hcm,
164
+ | "dcm": emb_dcm,
165
+ | "other1": emb_other1,
166
+ | "other2": emb_other2}
167
  max_ncells : None, int
168
+ | Maximum number of cells to test.
169
+ | If None, will test all cells.
170
  cell_inds_to_perturb : "all", list
171
+ | Default is perturbing each cell in the dataset.
172
+ | Otherwise, may provide a dict of indices of cells to perturb with keys start_ind and end_ind.
173
+ | start_ind: the first index to perturb.
174
+ | end_ind: the last index to perturb (exclusive).
175
+ | Indices will be selected *after* the filter_data criteria and sorting.
176
+ | Useful for splitting extremely large datasets across separate GPUs.
177
  emb_layer : {-1, 0}
178
+ | Embedding layer to use for quantification.
179
+ | 0: last layer (recommended for questions closely tied to model's training objective)
180
+ | -1: 2nd to last layer (recommended for questions requiring more general representations)
181
  forward_batch_size : int
182
+ | Batch size for forward pass.
183
  nproc : int
184
+ | Number of CPU processes to use.
185
  token_dictionary_file : Path
186
+ | Path to pickle file containing token dictionary (Ensembl ID:token).
187
  """
188
 
189
  self.perturb_type = perturb_type
 
399
  Perturb genes in input data and save as results in output_directory.
400
 
401
  Parameters
402
+ ~~~~~~~~~~
403
  model_directory : Path
404
+ | Path to directory containing model
405
  input_data_file : Path
406
+ | Path to directory containing .dataset inputs
407
  output_directory : Path
408
+ | Path to directory where perturbation data will be saved as batched pickle files
409
  output_prefix : str
410
+ | Prefix for output files
411
  """
412
 
413
  ### format output path ###
geneformer/in_silico_perturber_stats.py CHANGED
@@ -1,19 +1,27 @@
1
  """
2
  Geneformer in silico perturber stats generator.
3
 
4
- Usage:
5
- from geneformer import InSilicoPerturberStats
6
- ispstats = InSilicoPerturberStats(mode="goal_state_shift",
7
- combos=0,
8
- anchor_gene=None,
9
- cell_states_to_model={"state_key": "disease",
10
- "start_state": "dcm",
11
- "goal_state": "nf",
12
- "alt_states": ["hcm", "other1", "other2"]})
13
- ispstats.get_stats("path/to/input_data",
14
- None,
15
- "path/to/output_directory",
16
- "output_prefix")
 
 
 
 
 
 
 
 
17
  """
18
 
19
 
@@ -645,41 +653,41 @@ class InSilicoPerturberStats:
645
  Initialize in silico perturber stats generator.
646
 
647
  Parameters
648
- ----------
649
- mode : {"goal_state_shift","vs_null","mixture_model","aggregate_data","aggregate_gene_shifts"}
650
- Type of stats.
651
- "goal_state_shift": perturbation vs. random for desired cell state shift
652
- "vs_null": perturbation vs. null from provided null distribution dataset
653
- "mixture_model": perturbation in impact vs. no impact component of mixture model (no goal direction)
654
- "aggregate_data": aggregates cosine shifts for single perturbation in multiple cells
655
- "aggregate_gene_shifts": aggregates cosine shifts of genes in response to perturbation(s)
656
  genes_perturbed : "all", list
657
- Genes perturbed in isp experiment.
658
- Default is assuming genes_to_perturb in isp experiment was "all" (each gene in each cell).
659
- Otherwise, may provide a list of ENSEMBL IDs of genes perturbed as a group all together.
660
  combos : {0,1,2}
661
- Whether to perturb genes individually (0), in pairs (1), or in triplets (2).
662
  anchor_gene : None, str
663
- ENSEMBL ID of gene to use as anchor in combination perturbations or in testing effect on downstream genes.
664
- For example, if combos=1 and anchor_gene="ENSG00000136574":
665
- analyzes data for anchor gene perturbed in combination with each other gene.
666
- However, if combos=0 and anchor_gene="ENSG00000136574":
667
- analyzes data for the effect of anchor gene's perturbation on the embedding of each other gene.
668
  cell_states_to_model: None, dict
669
- Cell states to model if testing perturbations that achieve goal state change.
670
- Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
671
- state_key: key specifying name of column in .dataset that defines the start/goal states
672
- start_state: value in the state_key column that specifies the start state
673
- goal_state: value in the state_key column taht specifies the goal end state
674
- alt_states: list of values in the state_key column that specify the alternate end states
675
- For example: {"state_key": "disease",
676
- "start_state": "dcm",
677
- "goal_state": "nf",
678
- "alt_states": ["hcm", "other1", "other2"]}
679
  token_dictionary_file : Path
680
- Path to pickle file containing token dictionary (Ensembl ID:token).
681
  gene_name_id_dictionary_file : Path
682
- Path to pickle file containing gene name to ID dictionary (gene name:Ensembl ID).
683
  """
684
 
685
  self.mode = mode
@@ -847,64 +855,64 @@ class InSilicoPerturberStats:
847
  Get stats for in silico perturbation data and save as results in output_directory.
848
 
849
  Parameters
850
- ----------
851
  input_data_directory : Path
852
- Path to directory containing cos_sim dictionary inputs
853
  null_dist_data_directory : Path
854
- Path to directory containing null distribution cos_sim dictionary inputs
855
  output_directory : Path
856
- Path to directory where perturbation data will be saved as .csv
857
  output_prefix : str
858
- Prefix for output .csv
859
  null_dict_list: dict
860
- List of loaded null distribtion dictionary if more than one comparison vs. the null is to be performed
861
 
862
  Outputs
863
- ----------
864
  Definition of possible columns in .csv output file.
865
 
866
- Of note, not all columns will be present in all output files.
867
- Some columns are specific to particular perturbation modes.
868
-
869
- "Gene": gene token
870
- "Gene_name": gene name
871
- "Ensembl_ID": gene Ensembl ID
872
- "N_Detections": number of cells in which each gene or gene combination was detected in the input dataset
873
- "Sig": 1 if FDR<0.05, otherwise 0
874
-
875
- "Shift_to_goal_end": cosine shift from start state towards goal end state in response to given perturbation
876
- "Shift_to_alt_end": cosine shift from start state towards alternate end state in response to given perturbation
877
- "Goal_end_vs_random_pval": pvalue of cosine shift from start state towards goal end state by Wilcoxon
878
- pvalue compares shift caused by perturbing given gene compared to random genes
879
- "Alt_end_vs_random_pval": pvalue of cosine shift from start state towards alternate end state by Wilcoxon
880
- pvalue compares shift caused by perturbing given gene compared to random genes
881
- "Goal_end_FDR": Benjamini-Hochberg correction of "Goal_end_vs_random_pval"
882
- "Alt_end_FDR": Benjamini-Hochberg correction of "Alt_end_vs_random_pval"
883
-
884
- "Test_avg_shift": cosine shift in response to given perturbation in cells from test distribution
885
- "Null_avg_shift": cosine shift in response to given perturbation in cells from null distribution (e.g. random cells)
886
- "Test_vs_null_avg_shift": difference in cosine shift in cells from test vs. null distribution
887
- (i.e. "Test_avg_shift" minus "Null_avg_shift")
888
- "Test_vs_null_pval": pvalue of cosine shift in test vs. null distribution
889
- "Test_vs_null_FDR": Benjamini-Hochberg correction of "Test_vs_null_pval"
890
- "N_Detections_test": "N_Detections" in cells from test distribution
891
- "N_Detections_null": "N_Detections" in cells from null distribution
892
-
893
- "Anchor_shift": cosine shift in response to given perturbation of anchor gene
894
- "Test_token_shift": cosine shift in response to given perturbation of test gene
895
- "Sum_of_indiv_shifts": sum of cosine shifts in response to individually perturbing test and anchor genes
896
- "Combo_shift": cosine shift in response to given perturbation of both anchor and test gene(s) in combination
897
- "Combo_minus_sum_shift": difference of cosine shifts in response combo perturbation vs. sum of individual perturbations
898
- (i.e. "Combo_shift" minus "Sum_of_indiv_shifts")
899
- "Impact_component": whether the given perturbation was modeled to be within the impact component by the mixture model
900
- 1: within impact component; 0: not within impact component
901
- "Impact_component_percent": percent of cells in which given perturbation was modeled to be within impact component
902
-
903
- In case of aggregating gene shifts:
904
- "Perturbed": ID(s) of gene(s) being perturbed
905
- "Affected": ID of affected gene or "cell_emb" indicating the impact on the cell embedding as a whole
906
- "Cosine_shift_mean": mean of cosine shift of modeled perturbation on affected gene or cell
907
- "Cosine_shift_stdev": standard deviation of cosine shift of modeled perturbation on affected gene or cell
908
  """
909
 
910
  if self.mode not in [
 
1
  """
2
  Geneformer in silico perturber stats generator.
3
 
4
+ **Usage:**
5
+
6
+ .. code-block :: python
7
+
8
+ >>> from geneformer import InSilicoPerturberStats
9
+ >>> ispstats = InSilicoPerturberStats(mode="goal_state_shift",
10
+ ... cell_states_to_model={"state_key": "disease",
11
+ ... "start_state": "dcm",
12
+ ... "goal_state": "nf",
13
+ ... "alt_states": ["hcm", "other1", "other2"]})
14
+ ... )
15
+ >>> ispstats.get_stats("path/to/input_data",
16
+ ... None,
17
+ ... "path/to/output_directory",
18
+ ... "output_prefix")
19
+
20
+ **Description:**
21
+
22
+ | Collates data or calculates stats for in silico perturbations based on type of statistics specified in InSilicoPerturberStats.
23
+ | Input data is raw in silico perturbation results in the form of dictionaries outputted by ``in_silico_perturber``.
24
+
25
  """
26
 
27
 
 
653
  Initialize in silico perturber stats generator.
654
 
655
  Parameters
656
+ ~~~~~~~~~~
657
+ mode : {"goal_state_shift", "vs_null", "mixture_model", "aggregate_data", "aggregate_gene_shifts"}
658
+ | Type of stats.
659
+ | "goal_state_shift": perturbation vs. random for desired cell state shift
660
+ | "vs_null": perturbation vs. null from provided null distribution dataset
661
+ | "mixture_model": perturbation in impact vs. no impact component of mixture model (no goal direction)
662
+ | "aggregate_data": aggregates cosine shifts for single perturbation in multiple cells
663
+ | "aggregate_gene_shifts": aggregates cosine shifts of genes in response to perturbation(s)
664
  genes_perturbed : "all", list
665
+ | Genes perturbed in isp experiment.
666
+ | Default is assuming genes_to_perturb in isp experiment was "all" (each gene in each cell).
667
+ | Otherwise, may provide a list of ENSEMBL IDs of genes perturbed as a group all together.
668
  combos : {0,1,2}
669
+ | Whether to perturb genes individually (0), in pairs (1), or in triplets (2).
670
  anchor_gene : None, str
671
+ | ENSEMBL ID of gene to use as anchor in combination perturbations or in testing effect on downstream genes.
672
+ | For example, if combos=1 and anchor_gene="ENSG00000136574":
673
+ | analyzes data for anchor gene perturbed in combination with each other gene.
674
+ | However, if combos=0 and anchor_gene="ENSG00000136574":
675
+ | analyzes data for the effect of anchor gene's perturbation on the embedding of each other gene.
676
  cell_states_to_model: None, dict
677
+ | Cell states to model if testing perturbations that achieve goal state change.
678
+ | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
679
+ | state_key: key specifying name of column in .dataset that defines the start/goal states
680
+ | start_state: value in the state_key column that specifies the start state
681
+ | goal_state: value in the state_key column taht specifies the goal end state
682
+ | alt_states: list of values in the state_key column that specify the alternate end states
683
+ | For example: {"state_key": "disease",
684
+ | "start_state": "dcm",
685
+ | "goal_state": "nf",
686
+ | "alt_states": ["hcm", "other1", "other2"]}
687
  token_dictionary_file : Path
688
+ | Path to pickle file containing token dictionary (Ensembl ID:token).
689
  gene_name_id_dictionary_file : Path
690
+ | Path to pickle file containing gene name to ID dictionary (gene name:Ensembl ID).
691
  """
692
 
693
  self.mode = mode
 
855
  Get stats for in silico perturbation data and save as results in output_directory.
856
 
857
  Parameters
858
+ ~~~~~~~~~~
859
  input_data_directory : Path
860
+ | Path to directory containing cos_sim dictionary inputs
861
  null_dist_data_directory : Path
862
+ | Path to directory containing null distribution cos_sim dictionary inputs
863
  output_directory : Path
864
+ | Path to directory where perturbation data will be saved as .csv
865
  output_prefix : str
866
+ | Prefix for output .csv
867
  null_dict_list: dict
868
+ | List of loaded null distribtion dictionary if more than one comparison vs. the null is to be performed
869
 
870
  Outputs
871
+ ~~~~~~~
872
  Definition of possible columns in .csv output file.
873
 
874
+ | Of note, not all columns will be present in all output files.
875
+ | Some columns are specific to particular perturbation modes.
876
+
877
+ | "Gene": gene token
878
+ | "Gene_name": gene name
879
+ | "Ensembl_ID": gene Ensembl ID
880
+ | "N_Detections": number of cells in which each gene or gene combination was detected in the input dataset
881
+ | "Sig": 1 if FDR<0.05, otherwise 0
882
+
883
+ | "Shift_to_goal_end": cosine shift from start state towards goal end state in response to given perturbation
884
+ | "Shift_to_alt_end": cosine shift from start state towards alternate end state in response to given perturbation
885
+ | "Goal_end_vs_random_pval": pvalue of cosine shift from start state towards goal end state by Wilcoxon
886
+ | pvalue compares shift caused by perturbing given gene compared to random genes
887
+ | "Alt_end_vs_random_pval": pvalue of cosine shift from start state towards alternate end state by Wilcoxon
888
+ | pvalue compares shift caused by perturbing given gene compared to random genes
889
+ | "Goal_end_FDR": Benjamini-Hochberg correction of "Goal_end_vs_random_pval"
890
+ | "Alt_end_FDR": Benjamini-Hochberg correction of "Alt_end_vs_random_pval"
891
+
892
+ | "Test_avg_shift": cosine shift in response to given perturbation in cells from test distribution
893
+ | "Null_avg_shift": cosine shift in response to given perturbation in cells from null distribution (e.g. random cells)
894
+ | "Test_vs_null_avg_shift": difference in cosine shift in cells from test vs. null distribution
895
+ | (i.e. "Test_avg_shift" minus "Null_avg_shift")
896
+ | "Test_vs_null_pval": pvalue of cosine shift in test vs. null distribution
897
+ | "Test_vs_null_FDR": Benjamini-Hochberg correction of "Test_vs_null_pval"
898
+ | "N_Detections_test": "N_Detections" in cells from test distribution
899
+ | "N_Detections_null": "N_Detections" in cells from null distribution
900
+
901
+ | "Anchor_shift": cosine shift in response to given perturbation of anchor gene
902
+ | "Test_token_shift": cosine shift in response to given perturbation of test gene
903
+ | "Sum_of_indiv_shifts": sum of cosine shifts in response to individually perturbing test and anchor genes
904
+ | "Combo_shift": cosine shift in response to given perturbation of both anchor and test gene(s) in combination
905
+ | "Combo_minus_sum_shift": difference of cosine shifts in response combo perturbation vs. sum of individual perturbations
906
+ | (i.e. "Combo_shift" minus "Sum_of_indiv_shifts")
907
+ | "Impact_component": whether the given perturbation was modeled to be within the impact component by the mixture model
908
+ | 1: within impact component; 0: not within impact component
909
+ | "Impact_component_percent": percent of cells in which given perturbation was modeled to be within impact component
910
+
911
+ | In case of aggregating gene shifts:
912
+ | "Perturbed": ID(s) of gene(s) being perturbed
913
+ | "Affected": ID of affected gene or "cell_emb" indicating the impact on the cell embedding as a whole
914
+ | "Cosine_shift_mean": mean of cosine shift of modeled perturbation on affected gene or cell
915
+ | "Cosine_shift_stdev": standard deviation of cosine shift of modeled perturbation on affected gene or cell
916
  """
917
 
918
  if self.mode not in [
geneformer/tokenizer.py CHANGED
@@ -1,17 +1,37 @@
1
  """
2
  Geneformer tokenizer.
3
 
4
- Input data:
5
- Required format: raw counts scRNAseq data without feature selection as .loom file
6
- Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
7
- Required col (cell) attribute: "n_counts"; total read counts in that cell
8
- Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
9
- Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
10
-
11
- Usage:
12
- from geneformer import TranscriptomeTokenizer
13
- tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
14
- tk.tokenize_data("data_directory", "output_directory", "output_prefix")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  """
16
 
17
  from __future__ import annotations
@@ -68,11 +88,11 @@ class TranscriptomeTokenizer:
68
  Initialize tokenizer.
69
 
70
  Parameters
71
- ----------
72
  custom_attr_name_dict : None, dict
73
- Dictionary of custom attributes to be added to the dataset.
74
- Keys are the names of the attributes in the loom file.
75
- Values are the names of the attributes in the dataset.
76
  nproc : int
77
  Number of processes to use for dataset mapping.
78
  chunk_size: int = 512
@@ -119,7 +139,7 @@ class TranscriptomeTokenizer:
119
  Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
120
 
121
  Parameters
122
- ----------
123
  data_directory : Path
124
  Path to directory containing loom files or anndata files
125
  output_directory : Path
 
1
  """
2
  Geneformer tokenizer.
3
 
4
+ **Input data:**
5
+
6
+ | *Required format:* raw counts scRNAseq data without feature selection as .loom or anndata file.
7
+ | *Required row (gene) attribute:* "ensembl_id"; Ensembl ID for each gene.
8
+ | *Required col (cell) attribute:* "n_counts"; total read counts in that cell.
9
+
10
+ | *Optional col (cell) attribute:* "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria.
11
+ | *Optional col (cell) attributes:* any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below.
12
+
13
+ **Usage:**
14
+
15
+ .. code-block :: python
16
+
17
+ >>> from geneformer import TranscriptomeTokenizer
18
+ >>> tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ"}, nproc=4)
19
+ >>> tk.tokenize_data("data_directory", "output_directory", "output_prefix")
20
+
21
+ **Description:**
22
+
23
+ | Input data is a directory with .loom or .h5ad files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. The input file type is specified by the argument file_format in the tokenize_data function.
24
+
25
+ | The discussion below references the .loom file format, but the analagous labels are required for .h5ad files, just that they will be column instead of row attributes and vice versa due to the transposed format of the two file types.
26
+
27
+ | Genes should be labeled with Ensembl IDs (loom row attribute "ensembl_id"), which provide a unique identifer for conversion to tokens. Other forms of gene annotations (e.g. gene names) can be converted to Ensembl IDs via Ensembl Biomart. Cells should be labeled with the total read count in the cell (loom column attribute "n_counts") to be used for normalization.
28
+
29
+ | No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes "cell_type" and "organ_major" and one would like to retain these attributes as labels in the tokenized dataset with the new names "cell_type" and "organ", respectively, the following custom attribute dictionary should be provided: {"cell_type": "cell_type", "organ_major": "organ"}.
30
+
31
+ | Additionally, if the original .loom file contains a cell column attribute called "filter_pass", this column will be used as a binary indicator of whether to include these cells in the tokenized data. All cells with "1" in this attribute will be tokenized, whereas the others will be excluded. One may use this column to indicate QC filtering or other criteria for selection for inclusion in the final tokenized dataset.
32
+
33
+ | If one's data is in other formats besides .loom or .h5ad, one can use the relevant tools (such as Anndata tools) to convert the file to a .loom or .h5ad format prior to running the transcriptome tokenizer.
34
+
35
  """
36
 
37
  from __future__ import annotations
 
88
  Initialize tokenizer.
89
 
90
  Parameters
91
+ ~~~~~~~~~~
92
  custom_attr_name_dict : None, dict
93
+ | Dictionary of custom attributes to be added to the dataset.
94
+ | Keys are the names of the attributes in the loom file.
95
+ | Values are the names of the attributes in the dataset.
96
  nproc : int
97
  Number of processes to use for dataset mapping.
98
  chunk_size: int = 512
 
139
  Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
140
 
141
  Parameters
142
+ ~~~~~~~~~~
143
  data_directory : Path
144
  Path to directory containing loom files or anndata files
145
  output_directory : Path