DistilRoBERTa-query-wellformedness
This model utilizes the Distilroberta base architecture, which has been fine-tuned for a regression task on the Google's query wellformedness dataset encompassing 25,100 queries from the Paralex corpus. Each query received annotations from five raters, who provided a continuous rating indicating the degree to which the query is well-formed.
Model description
A regression head has been appended to the DistilRoBERTa model to tailor it for a regression task. This additional component is crucial and needs to be loaded alongside the base model during inference to ensure accurate predictions. The model evaluates the query for completeness and grammatical correctness, providing a score between 0 and 1, where 1 indicates correctness.
Usage
Inference API has been disabled as this is a regression task, not a text classification task, and HuggingFace does not provide a pipeline for regression tasks. Because of the dataset, it will perform better when handling queries in question form.
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("AdamCodd/distilroberta-query-wellformedness")
class RegressionModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = AutoModelForSequenceClassification.from_pretrained("AdamCodd/distilroberta-query-wellformedness")
self.regression_head = torch.nn.Linear(self.model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask, **kwargs):
outputs = self.model.base_model(input_ids=input_ids, attention_mask=attention_mask)
rating = self.regression_head(outputs.last_hidden_state[:, 0, :])
rating = torch.sigmoid(rating)
return rating.squeeze()
regression_model = RegressionModel()
# Do not forget to set the correct path to load the regression head
regression_model.regression_head.load_state_dict(torch.load("path_to_the_regression_head.pth"))
regression_model.eval()
# Examples
sentences = [
"The cat and dog in the yard.",
"she don't like apples.",
"Is rain sunny days sometimes?",
"She enjoys reading books and playing chess.",
"How many planets are there in our solar system?"
]
inputs = tokenizer(sentences, truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
outputs = regression_model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
predictions = outputs.tolist()
for i, rating in enumerate(predictions):
print(f'Sentence: {sentences[i]}')
print(f'Predicted Rating: {rating}\n')
Output:
Sentence: The cat and dog in the yard.
Predicted Rating: 0.20430190861225128
Sentence: she don't like apples.
Predicted Rating: 0.08289700001478195
Sentence: Is rain sunny days sometimes?
Predicted Rating: 0.20011138916015625
Sentence: She enjoys reading books and playing chess.
Predicted Rating: 0.8915354013442993
Sentence: How many planets are there in our solar system?
Predicted Rating: 0.974799394607544
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 5
Training results
Metrics: Mean Squared Error, R-Squared, Mean Absolute Error
'test_loss': 0.061837393790483475,
'test_mse': 0.061837393790483475,
'test_r2': 0.5726782083511353,
'test_mae': 0.183049738407135
Framework versions
- Transformers 4.34.1
- Pytorch lightning 2.1.0
- Tokenizers 0.14.1
If you want to support me, you can here.
- Downloads last month
- 16
Dataset used to train AdamCodd/distilroberta-query-wellformedness
Collection including AdamCodd/distilroberta-query-wellformedness
Evaluation results
- lossself-reported0.062
- Validation Mean Squared Errorself-reported0.062
- Validation R-Squaredself-reported0.573
- Validation Mean Absolute Errorself-reported0.183