It looks like the config file at '/tmp/tmpjk2l9r6n' is not a valid JSON file.
Hi,
I am trying to run the model and I’m getting this error when running select & load model.
Downloading (…)ptain-1337/CrudeBERT:
55.6k/? [00:00<00:00, 1.99MB/s]
JSONDecodeError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py in _get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
657 # Load config dict
--> 658 config_dict = cls._dict_from_json_file(resolved_config_file)
659 config_dict["_commit_hash"] = commit_hash
8 frames
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py in _get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
659 config_dict["_commit_hash"] = commit_hash
660 except (json.JSONDecodeError, UnicodeDecodeError):
--> 661 raise EnvironmentError(
662 f"It looks like the config file at '{resolved_config_file}' is not a valid JSON file."
663 )
OSError: It looks like the config file at '/tmp/tmpjk2l9r6n' is not a valid JSON file.
Can you help me with that? Thank you.
Hi,
I believe that error is caused by my side.
I'll upload the files for JSON and config again.
Please let me know if that works for you.
Best Regards,
Hopefully it will work now :)
hi Captain-1337! Read your paper, looks cool, I'd like to play with your model, however I'm seeing this on the Hosted Inference API: Can't load tokenizer using from_pretrained, please update its configuration: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Dear @rjwm
I appreciate your interest.
I'll try to fix it by Wednesday and notify you once done.
Best Regards
Hi,
I would like to apply this model to my dataset, but I Can't load tokenizer using from_pretrained, and get the same error:
stat: path should be string, bytes, os.PathLike or integer, not NoneType
can you please help me?
best regards
Hi,
I believe that error is unrelated to the tokenizer.
Try to set the path the following way:
sys.path.append('C:/Users/USERNAME/Desktop/finbert')
project_dir = Path.cwd().parent
path = project_dir/'Language_Models'/'CrudeBERT'
Warmest regards
Here is a quick guide on how you can use CrudeBERT
Step one:
Download the two files (crude_bert_config.json and crude_bert_model.bin)
from https://huggingface.co/Captain-1337/CrudeBERT/tree/main
Step two:
Create a Jupyter Notebook script in the same folder where the files are stored and include the code mentioned below:
Code:
import torch
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
List of example headlines
headlines = [
"Major Explosion, Fire at Oil Refinery in Southeast Philadelphia",
"PETROLEOS confirms Gulf of Mexico oil platform accident",
"CASUALTIES FEARED AT OIL ACCIDENT NEAR IRANS BORDER",
"EIA Chief expects Global Oil Demand Growth 1 M B/D to 2011",
"Turkey Jan-Oct Crude Imports +98.5% To 57.9M MT",
"China’s crude oil imports up 78.30% in February 2019",
"Russia Energy Agency: Sees Oil Output put Flat In 2005",
"Malaysia Oil Production Steady This Year At 700,000 B/D",
"ExxonMobil:Nigerian Oil Output Unaffected By Union Threat",
"Yukos July Oil Output Flat On Mo, 1.73M B/D - Prime-Tass",
"2nd UPDATE: Mexico’s Oil Output Unaffected By Hurricane",
"UPDATE: Ecuador July Oil Exports Flat On Mo At 337,000 B/D",
"China February Crude Imports -16.0% On Year",
"Turkey May Crude Imports down 11.0% On Year",
"Japan June Crude Oil Imports decrease 10.9% On Yr",
"Iran’s Feb Oil Exports +20.9% On Mo at 1.56M B/D - Official",
"Apache announces large petroleum discovery in Philadelphia",
"Turkey finds oil near Syria, Iraq border"
]
example_headlines = pd.DataFrame(headlines, columns=["Headline"])
config_path = './crude_bert_config.json'
model_path = './crude_bert_model.bin'
Load the configuration
config = AutoConfig.from_pretrained(config_path)
Create the model from the configuration
model = AutoModelForSequenceClassification.from_config(config)
Load the model's state dictionary
state_dict = torch.load(model_path)
Inspect keys, if "bert.embeddings.position_ids" is unexpected, remove or adjust it
state_dict.pop("bert.embeddings.position_ids", None)
Load the adjusted state dictionary into the model
model.load_state_dict(state_dict, strict=False) # Using strict=False to ignore non-critical mismatches
Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Define the prediction function
def predict_to_df(texts, model, tokenizer):
model.eval()
data = []
for text in texts:
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
softmax_scores = torch.nn.functional.softmax(logits, dim=-1)
pred_label_id = torch.argmax(softmax_scores, dim=-1).item()
class_names = ['positive', 'negative', 'neutral']
predicted_label = class_names[pred_label_id]
data.append([text, predicted_label])
df = pd.DataFrame(data, columns=["Headline", "Classification"])
return df
Create DataFrame
example_headlines = pd.DataFrame(headlines, columns=["Headline"])
Apply classification
result_df = predict_to_df(example_headlines['Headline'].tolist(), model, tokenizer)
result_df
Step three:
Execute the cells of the Jupyter Notebook.
If you face any difficulties or have other questions, contact me here or on LinkedIn.
FYI: I took the example headlines from one of our recent publications:
So, your classification output should reflect this as well.