TCRT5 model (pre-trained)

Model description

This model is the pre-trained model used for finetuning TCRT5. The finetuned model is a seq2seq model designed for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that is built on the T5 architecture and operationalized by the associated HuggingFace abstraction. It is released along with this paper.

Intended uses & limitations

This model is released to be used for seq2seq finetuning on custom datasets. It may be useful for both the pMHC -> TCR (TCR design) or TCR -> pMHC (TCR de-orphanization) sequence generation. Additionally, it can also be used (though it has not been tested in this capacity) for finetuning on classification or regression-style tasks involving sequence representations of TCR (CDR3 β\beta) and pMHC (peptide-pseudo sequence):

How to use

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_pre_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_pre_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')

# Can be useful for classification/regression downstream tasks
enc_outputs = tcrt5.encoder(**encoded_pmhc)

Limitations

As it stands, the model was jointly pre-trained on peptide-pseudosequence and CDR3 β\beta sequences. As such sequences comprised of just peptide, CDR3 α\alpha, or other parts of the TCR would be out-of-distribution OOD.

Training data

TCRT5 was pre-trained on masked span reconstruction of on a dataset built around ~14M CDR3 β\beta sequences from TCRdb as well as ~780k peptide-pseudosequence pairs taken from IEDB. To correct for the data imbalance, upsampling was used to bring the TCR:pMHC sequence ratio to 70:30.

Training procedure

Preprocessing

All amino acid sequences, and V/J gene names were standardized using the tidytcells package. See here. MHC allele information was standardized using mhcgnomes, available here before mapping allele information to the MHC pseudo-sequence as defined in NetMHCpan.

Pre-training

TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. This forces the model to learn richer k-mer dependencies of the masked sequences.

Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
        * Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
        * Ensure that the spans are not directly adjacent to ensure max_span_length is observed
        * Once the span masks are generated according to T5 standards mask the inputs and generate the targets 
    
    
    Example Input:
    
    CASSLGQGYEQYF
    
    Masked Input:
    
    CASSLG[X]GY[Y]F
    
    Target:
    
    [X]Q[Y]EQY[Z].

Hyperparameters:

Hparam #Enc. #Dec. Vocab. Size D_model Num Attn. Heads Dropout D_ff
10 10 128 256 16 0.1 1024
TA Bsz. LR Steps Weight Decay Warmup
512 3e- 4 168k (~4eps) 0.1 500

Hardware

  • Hardware Type: NVIDIA A100 80GB PCIe
  • Hours used: 60
  • Carbon Emitted: 6.48 kg CO2 eq.

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

BibTeX entry and citation info

@article{dkarthikeyan2024tcrtranslate,
  title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
  author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
  journal={bioArXiv},
  year={2024},
}
Downloads last month
129
Safetensors
Model size
42M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for dkarthikeyan1/tcrt5_pre_tcrdb

Finetunes
1 model

Collection including dkarthikeyan1/tcrt5_pre_tcrdb