--- tags: - learned sparse - transformers - retrieval - passage-retrieval - document-expansion - bag-of-words license: apache-2.0 language: en base_model: - atomic-canyon/fermi-bert-1024 ---



# fermi-1024: Sparse Retrieval Model for Nuclear Power This **sparse retrieval model** is optimized for nuclear-specific applications. It encodes both queries and documents into high-dimensional sparse vectors, where the non-zero dimensions correspond to specific tokens in the vocabulary, and their values indicate the relative importance of those tokens. The vocabulary, and thus the sparse embeddings, are based on a nuclear-specific tokenizer. For example, terms like "NRC" are represented as single tokens rather than being split into multiple tokens. This approach improves both accuracy and efficiency. To achieve this, we trained a nuclear-specific [BERT base model](https://huggingface.co/atomic-canyon/fermi-bert-1024). ### Specifications - **Developed by:** [Atomic Canyon](https://atomic-canyon.com/) - **Finetuned from model:** [fermi-bert-1024](https://huggingface.co/atomic-canyon/fermi-bert-1024) - **Context Length:** 1024 - **Vocab Size:** 30522 - **License:** `Apache 2.0` ## Training `fermi-1024` was trained on [MS MARCO Passage Dataset](https://microsoft.github.io/msmarco/) using the [LSR framework](https://github.com/thongnt99/learned-sparse-retrieval) using the teacher model [ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2). Trained on the Oak Ridge National Laboratory [Frontier supercomputer](https://www.olcf.ornl.gov/frontier/) using MI250X AMD GPUs. ## Evaluation The sparse embedding model was primarily evaluated for its effectiveness in information retrieval within the nuclear energy domain. Due to the absence of domain-specific benchmarks, we developed [FermiBench](https://huggingface.co/datasets/atomic-canyon/FermiBench) to assess the model’s performance on nuclear-related texts. In addition, the model was tested on the MS MARCO dev split and the BEIR benchmark to ensure broader applicability. The model demonstrates strong retrieval capabilities, particularly in handling nuclear-specific jargon and documents. Although there are standard benchmarks and tooling for evaluating dense embedding models, we found no open, standardized tooling for evaluating sparse embedding models. To support the community, we are [releasing our benchmark tooling](https://github.com/atomic-canyon/fermi), built on top of [BEIR](https://github.com/beir-cellar/beir) and [pyserini](https://github.com/castorini/pyserini). All evaluation numbers were produced with that tool and should therefore be reproducible. | Model | FermiBench NDCG@10 | FermiBench FLOPS | MSMarco Dev NDCG@10 | BEIR* NDCG@10 | BEIR* FLOPS | | --------------------------------- | ------------------ | ---------------- | ------------------- | ------------- | ------------ | | fermi-512 | 0.74 | 7.07 | 0.45 | 0.46 | 9.14 | | fermi-1024 | 0.72 | 4.75 | 0.44 | 0.46 | 7.5 | | splade-cocondenser-ensembledistil | 0.64 | 12.9 | 0.45 | 0.46 | 12.4 | \* BEIR benchmark was a subset containng trec-covid, nfcorpus, arguana, scidocs, scifact. ### Efficiency Given the massive scale of documentation in nuclear energy, efficiency is crucial. Our model addresses this in several ways: - Our 1024-length embedding model reduces the number of required embeddings by half, significantly lowering computational costs. - The custom tokenizer, designed for nuclear-specific jargon, encodes documents and queries using fewer tokens, improving computational efficiency. - Additionally, our models produce sparser vectors, reducing FLOPs and, as a secondary benefit, lowering storage requirements for indexing. ## Usage ```python import itertools import torch from transformers import AutoModelForMaskedLM, AutoTokenizer # get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size def get_sparse_vector(feature, output): values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1) values = torch.log(1 + torch.relu(values)) values[:,special_token_ids] = 0 return values # transform the sparse vector to a dict of (token, weight) def transform_sparse_vector_to_dict(sparse_vector): sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True) non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist() number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist() tokens = [id_to_token[_id] for _id in token_indices.tolist()] output = [] end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample)) for i in range(len(end_idxs)-1): token_strings = tokens[end_idxs[i]:end_idxs[i+1]] weights = non_zero_values[end_idxs[i]:end_idxs[i+1]] output.append(dict(zip(token_strings, weights))) return output # load the model model = AutoModelForMaskedLM.from_pretrained("atomic-canyon/fermi-1024") tokenizer = AutoTokenizer.from_pretrained("atomic-canyon/fermi-1024") # set the special tokens and id_to_token transform for post-process special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()] id_to_token = [""] * tokenizer.vocab_size for token, _id in tokenizer.vocab.items(): id_to_token[_id] = token query = "What is the maximum heat load per spent fuel assembly for the EOS-37PTH?" document = "For the EOS-37PTH DSC, add two new heat load zone configurations (HLZCs) for the EOS37PTH for higher heat load assemblies, up to 3.5 kW/assembly, that also allow for damaged and failed fuel storage." # encode the query & document feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False) output = model(**feature)[0] sparse_vector = get_sparse_vector(feature, output) # get similarity score sim_score = torch.matmul(sparse_vector[0],sparse_vector[1]) print(sim_score) query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector) for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True): if token in document_query_token_weight: print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token)) ``` # Acknowledgement This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.