Model Card for Model ID
Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law. The dataset is of the BIO format. The model achieves an F1-score of 0.32.
Model Details
Model Description
The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named entities in the BIO format.
- Developed by: Shashank M Chakravarthy
- Funded by [optional]: NA
- Shared by [optional]: NA
- Model type: Statistical Model
- Language(s) (NLP): English
- License: Apache 2.0 License
- Finetuned from model [optional]: NA
Model Sources [optional]
- Repository: NA
- Paper [optional]: [https://aclanthology.org/2024.nllp-1.33.pdf]
- Demo [optional]: NA
Uses
The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.
Direct Use
The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.
Downstream Use [optional]
Out-of-Scope Use
This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.
Bias, Risks, and Limitations
The limitation comes with the handcrafting the features.
Recommendations
If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.
How to Get Started with the Model
Use the code below to get started with the model.
Load libraries
import ast
import pandas as pd
import joblib
import nltk
from nltk import pos_tag
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
Check if nltk modules are downloaded, if not download them
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download("averaged_perceptron_tagger")
Class for grouping tokens as sentences (redundant if text processed directly)
class getsentence(object):
'''
This class is used to get the sentences from the dataset.
Converts from BIO format to sentences using their sentence numbers
'''
def __init__(self, data):
self.n_sent = 1.0
self.data = data
self.empty = False
self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
self.sentences = [s for s in self.grouped]
def _agg_func(self, s):
return [(w, p) for w, p in zip(s["token"].values.tolist(),
s["pos_tag"].values.tolist())]
Creates features for words in a sentence (code can be reduced using iteration)
def word2features(sent, i):
'''
This method is used to extract features from the words in the sentence.
The main features extracted are:
- word.lower(): The word in lowercase
- word.isdigit(): If the word is a digit
- word.punct(): If the word is a punctuation
- postag: The pos tag of the word
- word.lemma(): The lemma of the word
- word.stem(): The stem of the word
The features (not all) are also extracted for the 4 previous and 4 next words.
'''
global token_count
wordnet_lemmatizer = WordNetLemmatizer()
porter_stemmer = PorterStemmer()
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word.isdigit()': word.isdigit(),
# Check if its punctuations
'word.punct()': word in string.punctuation,
'postag': postag,
# Lemma of the word
'word.lemma()': wordnet_lemmatizer.lemmatize(word),
# Stem of the word
'word.stem()': porter_stemmer.stem(word)
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.isdigit()': word1.isdigit(),
'-1:word.punct()': word1 in string.punctuation,
'-1:postag': postag1
})
if i - 2 >= 0:
features.update({
'-2:word.lower()': sent[i-2][0].lower(),
'-2:word.isdigit()': sent[i-2][0].isdigit(),
'-2:word.punct()': sent[i-2][0] in string.punctuation,
'-2:postag': sent[i-2][1]
})
if i - 3 >= 0:
features.update({
'-3:word.lower()': sent[i-3][0].lower(),
'-3:word.isdigit()': sent[i-3][0].isdigit(),
'-3:word.punct()': sent[i-3][0] in string.punctuation,
'-3:postag': sent[i-3][1]
})
if i - 4 >= 0:
features.update({
'-4:word.lower()': sent[i-4][0].lower(),
'-4:word.isdigit()': sent[i-4][0].isdigit(),
'-4:word.punct()': sent[i-4][0] in string.punctuation,
'-4:postag': sent[i-4][1]
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.isdigit()': word1.isdigit(),
'+1:word.punct()': word1 in string.punctuation,
'+1:postag': postag1
})
if i + 2 < len(sent):
features.update({
'+2:word.lower()': sent[i+2][0].lower(),
'+2:word.isdigit()': sent[i+2][0].isdigit(),
'+2:word.punct()': sent[i+2][0] in string.punctuation,
'+2:postag': sent[i+2][1]
})
if i + 3 < len(sent):
features.update({
'+3:word.lower()': sent[i+3][0].lower(),
'+3:word.isdigit()': sent[i+3][0].isdigit(),
'+3:word.punct()': sent[i+3][0] in string.punctuation,
'+3:postag': sent[i+3][1]
})
if i + 4 < len(sent):
features.update({
'+4:word.lower()': sent[i+4][0].lower(),
'+4:word.isdigit()': sent[i+4][0].isdigit(),
'+4:word.punct()': sent[i+4][0] in string.punctuation,
'+4:postag': sent[i+4][1]
})
else:
features['EOS'] = True
return features
Obtain features for a given sentence
def sent2features(sent):
'''
This method is used to extract features from the sentence.
'''
return [word2features(sent, i) for i in range(len(sent))]
Load file from your directory
df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
Evaluate data type and create pos_tags for each token
df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
for tag in pos_tag(x)])
Aggregate tokens to sentences
data_eval = []
for i in range(len(df_eval)):
for j in range(len(df_eval["tokens"][i])):
data_eval.append(
{
"sentence_num": i+1,
"id": df_eval["id"][i],
"token": df_eval["tokens"][i][j],
"pos_tag": df_eval["pos_tags"][i][j],
}
)
data_eval = pd.DataFrame(data_eval)
getter = getsentence(data_eval)
sentences_eval = getter.sentences
X_eval = [sent2features(s) for s in sentences_eval]
Load model from your directory
crf = joblib.load("../models/crf.pkl")
y_pred_eval = crf.predict(X_eval)
print("NER tags predicted.")
df_eval["ner_tags"] = y_pred_eval
df_eval.drop(columns=["pos_tags"], inplace=True)
print("Saving the predictions...")
df_eval.to_csv("predictions_NERLens.csv", index=False)
print("Predictions saved.")
Training Details
Training Data
[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
Training Procedure
The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features, the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
Preprocessing [optional]
For every token, POS_tags were assigned using NLTK library.
Training Hyperparameters
- Training regime: NA
Speeds, Sizes, Times [optional]
NA
Evaluation
The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.
Testing Data, Factors & Metrics
Testing Data
[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
Factors
[More Information Needed]
Metrics
Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.
Results
0.32 macro-F1 score on unseen data.
Summary
The model was designed and developed to tackle NER task in unstructured text.
Model Examination [optional]
NA
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 13th Gen Intel(R) Core(TM) i7-1365U
- Hours used: 0.5 hours
- Cloud Provider: NA
- Compute Region: NA
- Carbon Emitted: Unknown
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 0