cdsBERT

Model description

cdsBERT is a pLM with a codon vocabulary that was seeded with ProtBERT and trained with a novel vocabulary extension pipeline called MELD. cdsBERT offers a highly biologically relevant latent space with excellent EC number prediction. Specifically, this is the full-precision checkpoint after the MLM objective on 4 million CDS examples.

How to use

# Imports
import torch
from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence = '( Z [MASK] V L P Y G D E K L S P Y G D G G D V G Q I F s C # L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]' #  CCDS207.1|Hs110|chr1

# Create a fill-mask prediction pipeline
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Predict the masked token
prediction = unmasker(sequence)
print(prediction)

Intended use and limitations

cdsBERT serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction. This checkpoint after MLM can conduct mask-filling, while the cdsBERT+ checkpoint has a more biochemically relevant latent space.

Our lab

The Gleghorn lab is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.

Please cite

@article {Hallee_cds_2023, author = {Logan Hallee, Nikolaos Rafailidis, and Jason P. Gleghorn}, title = {cdsBERT - Extending Protein Language Models with Codon Awareness}, year = {2023}, doi = {10.1101/2023.09.15.558027}, publisher = {Cold Spring Harbor Laboratory}, journal = {bioRxiv} }

Downloads last month
64
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including GleghornLab/cdsBERT