TCRT5 model (pre-trained)

Model description

This model is the pre-trained model used for finetuning TCRT5. The finetuned model is a seq2seq model designed for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that is built on the T5 architecture and operationalized by the associated HuggingFace abstraction. It is released along with this paper.

Intended uses & limitations

This model is released to be used for seq2seq finetuning on custom datasets. It may be useful for both the pMHC -> TCR (TCR design) or TCR -> pMHC (TCR de-orphanization) sequence generation. Additionally, it can also be used (though it has not been tested in this capacity) for finetuning on classification or regression-style tasks involving sequence representations of TCR (CDR3 $\beta$ ) and pMHC (peptide-pseudo sequence):

How to use

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_pre_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_pre_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')

# Can be useful for classification/regression downstream tasks
enc_outputs = tcrt5.encoder(**encoded_pmhc)

Limitations

As it stands, the model was jointly pre-trained on peptide-pseudosequence and CDR3 $\beta$ sequences. As such sequences comprised of just peptide, CDR3 $\alpha$ , or other parts of the TCR would be out-of-distribution OOD.

Training data

TCRT5 was pre-trained on masked span reconstruction of on a dataset built around ~14M CDR3 $\beta$ sequences from TCRdb as well as ~780k peptide-pseudosequence pairs taken from IEDB. To correct for the data imbalance, upsampling was used to bring the TCR:pMHC sequence ratio to 70:30.

Training procedure

Preprocessing

All amino acid sequences, and V/J gene names were standardized using the tidytcells package. See here. MHC allele information was standardized using mhcgnomes, available here before mapping allele information to the MHC pseudo-sequence as defined in NetMHCpan.

Pre-training

TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. This forces the model to learn richer k-mer dependencies of the masked sequences.

Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
        * Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
        * Ensure that the spans are not directly adjacent to ensure max_span_length is observed
        * Once the span masks are generated according to T5 standards mask the inputs and generate the targets 
    
    
    Example Input:
    
    CASSLGQGYEQYF
    
    Masked Input:
    
    CASSLG[X]GY[Y]F
    
    Target:
    
    [X]Q[Y]EQY[Z].

Hyperparameters:

Hparam	#Enc.	#Dec.	Vocab. Size	D_model	Num Attn. Heads	Dropout	D_ff
	10	10	128	256	16	0.1	1024

TA	Bsz.	LR	Steps	Weight Decay	Warmup
	512	3e- 4	168k (~4eps)	0.1	500

Hardware

Hardware Type: NVIDIA A100 80GB PCIe
Hours used: 60
Carbon Emitted: 6.48 kg CO2 eq.

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

BibTeX entry and citation info

@article{dkarthikeyan2024tcrtranslate,
  title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
  author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
  journal={bioArXiv},
  year={2024},
}

dkarthikeyan1
/

tcrt5_pre_tcrdb