It looks like the config file at '/tmp/tmpjk2l9r6n' is not a valid JSON file.

#1
by burcukoc - opened

Hi,

I am trying to run the model and I’m getting this error when running select & load model.

Downloading (…)ptain-1337/CrudeBERT:
55.6k/? [00:00<00:00, 1.99MB/s]

JSONDecodeError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py in _get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
657 # Load config dict
--> 658 config_dict = cls._dict_from_json_file(resolved_config_file)
659 config_dict["_commit_hash"] = commit_hash

8 frames
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py in _get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
659 config_dict["_commit_hash"] = commit_hash
660 except (json.JSONDecodeError, UnicodeDecodeError):
--> 661 raise EnvironmentError(
662 f"It looks like the config file at '{resolved_config_file}' is not a valid JSON file."
663 )

OSError: It looks like the config file at '/tmp/tmpjk2l9r6n' is not a valid JSON file.

Can you help me with that? Thank you.

Hi,

I believe that error is caused by my side.
I'll upload the files for JSON and config again.
Please let me know if that works for you.

Best Regards,

Hopefully it will work now :)

hi Captain-1337! Read your paper, looks cool, I'd like to play with your model, however I'm seeing this on the Hosted Inference API: Can't load tokenizer using from_pretrained, please update its configuration: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Dear @rjwm

I appreciate your interest.
I'll try to fix it by Wednesday and notify you once done.

Best Regards

Hi,

I would like to apply this model to my dataset, but I Can't load tokenizer using from_pretrained, and get the same error:
stat: path should be string, bytes, os.PathLike or integer, not NoneType

can you please help me?
best regards

Hi,

I believe that error is unrelated to the tokenizer.
Try to set the path the following way:

sys.path.append('C:/Users/USERNAME/Desktop/finbert')
project_dir = Path.cwd().parent

path = project_dir/'Language_Models'/'CrudeBERT'

Warmest regards

Here is a quick guide on how you can use CrudeBERT

Step one:

Download the two files (crude_bert_config.json and crude_bert_model.bin)
from https://huggingface.co/Captain-1337/CrudeBERT/tree/main

Step two:

Create a Jupyter Notebook script in the same folder where the files are stored and include the code mentioned below:

Code:

import torch
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd

List of example headlines

headlines = [
"Major Explosion, Fire at Oil Refinery in Southeast Philadelphia",
"PETROLEOS confirms Gulf of Mexico oil platform accident",
"CASUALTIES FEARED AT OIL ACCIDENT NEAR IRANS BORDER",
"EIA Chief expects Global Oil Demand Growth 1 M B/D to 2011",
"Turkey Jan-Oct Crude Imports +98.5% To 57.9M MT",
"China’s crude oil imports up 78.30% in February 2019",
"Russia Energy Agency: Sees Oil Output put Flat In 2005",
"Malaysia Oil Production Steady This Year At 700,000 B/D",
"ExxonMobil:Nigerian Oil Output Unaffected By Union Threat",
"Yukos July Oil Output Flat On Mo, 1.73M B/D - Prime-Tass",
"2nd UPDATE: Mexico’s Oil Output Unaffected By Hurricane",
"UPDATE: Ecuador July Oil Exports Flat On Mo At 337,000 B/D",
"China February Crude Imports -16.0% On Year",
"Turkey May Crude Imports down 11.0% On Year",
"Japan June Crude Oil Imports decrease 10.9% On Yr",
"Iran’s Feb Oil Exports +20.9% On Mo at 1.56M B/D - Official",
"Apache announces large petroleum discovery in Philadelphia",
"Turkey finds oil near Syria, Iraq border"
]
example_headlines = pd.DataFrame(headlines, columns=["Headline"])

config_path = './crude_bert_config.json'
model_path = './crude_bert_model.bin'

Load the configuration

config = AutoConfig.from_pretrained(config_path)

Create the model from the configuration

model = AutoModelForSequenceClassification.from_config(config)

Load the model's state dictionary

state_dict = torch.load(model_path)

Inspect keys, if "bert.embeddings.position_ids" is unexpected, remove or adjust it

state_dict.pop("bert.embeddings.position_ids", None)

Load the adjusted state dictionary into the model

model.load_state_dict(state_dict, strict=False) # Using strict=False to ignore non-critical mismatches

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Define the prediction function

def predict_to_df(texts, model, tokenizer):
model.eval()
data = []
for text in texts:
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
softmax_scores = torch.nn.functional.softmax(logits, dim=-1)
pred_label_id = torch.argmax(softmax_scores, dim=-1).item()
class_names = ['positive', 'negative', 'neutral']
predicted_label = class_names[pred_label_id]
data.append([text, predicted_label])
df = pd.DataFrame(data, columns=["Headline", "Classification"])
return df

Create DataFrame

example_headlines = pd.DataFrame(headlines, columns=["Headline"])

Apply classification

result_df = predict_to_df(example_headlines['Headline'].tolist(), model, tokenizer)
result_df

Step three:

Execute the cells of the Jupyter Notebook.

If you face any difficulties or have other questions, contact me here or on LinkedIn.

FYI: I took the example headlines from one of our recent publications:
image.png

So, your classification output should reflect this as well.

Sign up or log in to comment