Papers
arxiv:2310.14282

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Published on Oct 22, 2023
Authors:
,
,

Abstract

Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.

Community

Dataset is now available here: https://github.com/katzurik/NERetrieve

Nice! It would be cool to have it on the Hub as well, cc @davanstrien

/cc @tomaarsen Could be very interesting for new SpanMarker models.

500 entity types 🤯
I'm a tad busy today, so I'm unsure when I'll have time for this, but it would be awesome to get a model out there for this, on each of the 3 (?) "levels".

There's also another challenge:

I had a look at their first task here.

Corpus has the following format (NERetrive_IR_corpus.jsonl):

{"id": "XAZbWYcB1INCf0Uy5i5T", 
"doc_id": "International Laser Display Association-3",
"para_index": 3,
"title": "International Laser Display Association",
"content": "In April 2006, the ILDA lobbied for the rights to sell lasers for informational display and/or entertainment. This requires persuading the Food and Drug Administration's Center for Devices and Radiological Health (CDRH; www.FDA.gov/cdrh) to update the administrative process and ensure laser light shows and displays are safer for the public."}

Train file (NERetrive_IR_train.jsonl):

{"id": "XAZbWYcB1INCf0Uy5i5T", 
"doc_id": "International Laser Display Association-3", 
"title": "International Laser Display Association", 
"page_id": "2142294", 
"para_index": 3, 
"tagged_entities": {"Trade association": {"International_Laser_Display_Association": {"ilda": [[4]]}}}, 
"tagged_entity_types": ["Trade association"]}

So one would need to read-in both files to construct CoNLL-like training file.

More challenges:

The named entity "ILDA" has index 4 in the sentence (whitespace tokenized). But usually we expect a properly tokenized corpus for NER. So one would need to tokenize the corpus and re-map index to the found named entity. Or we leave corpus as it is?

Hey! I am one of the authors of this paper,
I was not aware of the discussion here (:

@stefan-it
Is there anyway I can help you ? You can also contact me at [email protected]

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.14282 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.