---
language:
- en
license: mit
library_name: peft
tags:
- esm
- esm2
- ESM-2
- protein language model
- LoRA
- Low Rank Adaptation
- biology
- CAFA-5
- protein function prediction
datasets:
- AmelieSchreiber/cafa_5
base_model: AmelieSchreiber/cafa_5_protein_function_prediction
---

# ESM-2 LoRA for CAFA-5 Protein Function Prediction
This is a Low Rank Adaptation (LoRA) of [cafa_5_protein_function_prediction](https://huggingface.co/AmelieSchreiber/cafa_5_protein_function_prediction), 
which is a fine-tuned (without LoRA) version of `facebook/esm2_t6_8M_UR50D`, for the same task. For more information 
on training a sequence classifier langauge model with LoRA [see here](https://github.com/huggingface/peft/blob/main/examples/sequence_classification/LoRA.ipynb). 
Note, this is for natural language processing and must be adapted to our use case using a protein language model like ESM-2. 

## Training procedure
 Using Hugging Face's Parameter Efficient Fine-Tuning (PEFT) library, a Low Rank Adaptation was trained for 
 3 epochs on the CAFA-5 protein sequences dataset at an 80/20 train/test split. The dataset can be 
 [found here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). Somewhat naively, the model was trained on
 the `train_sequences.fasta` file of protein sequences, with the `train_terms.tsv` file serving as the labels. 
 The gene ontology used is a hierarchy, and so the labels lower in the hierchay should be weighted more, or the 
 graph structure should be taken into account. The model achieved the following metrics:

```
Epoch: 3,
Validation Loss: 0.0031,
Validation Micro F1: 0.3752,
Validation Macro F1: 0.9968,
Validation Micro Precision: 0.5287,
Validation Macro Precision: 0.9992,
Validation Micro Recall: 0.2911,
Validation Macro Recall: 0.9968
```

Future iterations of this model will likely need to take into account class weighting. 

### Framework versions

- PEFT 0.4.0

## Using the Model

To use the model, try downloading the data [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5), 
adjust the paths to the files in the code below to their local paths on your machine, and try running:

```python
import os
import numpy as np
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW
from torch.nn.functional import binary_cross_entropy_with_logits
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score
from accelerate import Accelerator
from Bio import SeqIO

# Step 1: Data Preprocessing
fasta_file = "data/Train/train_sequences.fasta"
tsv_file = "data/Train/train_terms.tsv"

fasta_data = {}
tsv_data = {}

for record in SeqIO.parse(fasta_file, "fasta"):
    fasta_data[record.id] = str(record.seq)

with open(tsv_file, 'r') as f:
    for line in f:
        parts = line.strip().split("\t")
        tsv_data[parts[0]] = parts[1:]

unique_terms = list(set(term for terms in tsv_data.values() for term in terms))

def parse_fasta(file_path):
    """
    Parses a FASTA file and returns a list of sequences.
    """
    with open(file_path, 'r') as f:
        content = f.readlines()

    sequences = []
    current_sequence = ""

    for line in content:
        if line.startswith(">"):
            if current_sequence:
                sequences.append(current_sequence)
                current_sequence = ""
        else:
            current_sequence += line.strip()

    if current_sequence:
        sequences.append(current_sequence)

    return sequences

# Parse the provided FASTA file
fasta_file_path = "data/Test/testsuperset.fasta"
protein_sequences = parse_fasta(fasta_file_path)
# protein_sequences[:3]  # Displaying the first 3 sequences for verification

import torch
from transformers import AutoTokenizer, EsmForSequenceClassification
from sklearn.metrics import precision_recall_fscore_support

# 1. Parsing the go-basic.obo file (Assuming this is still needed)
def parse_obo_file(file_path):
    with open(file_path, 'r') as f:
        data = f.read().split("[Term]")
        
    terms = []
    for entry in data[1:]:
        lines = entry.strip().split("\n")
        term = {}
        for line in lines:
            if line.startswith("id:"):
                term["id"] = line.split("id:")[1].strip()
            elif line.startswith("name:"):
                term["name"] = line.split("name:")[1].strip()
            elif line.startswith("namespace:"):
                term["namespace"] = line.split("namespace:")[1].strip()
            elif line.startswith("def:"):
                term["definition"] = line.split("def:")[1].split('"')[1]
        terms.append(term)
    return terms

# Let's assume the path to go-basic.obo is as follows (please modify if different)
obo_file_path = "data/Train/go-basic.obo"  
parsed_terms = parse_obo_file("data/Train/go-basic.obo")  # Replace with your path

# 2. Load the saved model and tokenizer
# Assuming the model path provided is correct
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

# Load the tokenizer and model
model_id = "AmelieSchreiber/esm2_t6_8M_UR50D_cafa5_lora"  # Replace with your Hugging Face hub model name
tokenizer = AutoTokenizer.from_pretrained(model_id)

# First, we load the underlying base model
base_model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Then, we load the model with PEFT
model = PeftModel.from_pretrained(base_model, model_id)
loaded_model = model
loaded_tokenizer = AutoTokenizer.from_pretrained(model_id)

# 3. The predict_protein_function function
def predict_protein_function(sequence, model, tokenizer, go_terms):
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        predicted_indices = torch.where(predictions > 0.05)[1].tolist()
    
    functions = []
    for idx in predicted_indices:
        term_id = unique_terms[idx]  # Use the unique_terms list from your training script
        for term in go_terms:
            if term["id"] == term_id:
                functions.append(term["name"])
                break
                
    return functions

# 4. Predicting protein function for the sequences in the FASTA file
protein_functions = {}
for seq in protein_sequences[:20]:  # Using only the first 3 sequences for demonstration
    predicted_functions = predict_protein_function(seq, loaded_model, loaded_tokenizer, parsed_terms)
    protein_functions[seq[:20] + "..."] = predicted_functions  # Using first 20 characters as key

protein_functions
```