gbpatentdata/lt-patent-inventor-linking

This is a LinkTransformer model. At its core this model this is a sentence transformer model sentence-transformers model - it just wraps around the class. This model has been fine-tuned on the model: sentence-transformers/all-mpnet-base-v2. It is pretrained for the language: en.

Usage (Sentence-Transformers)

To use this model using sentence-transformers:

from sentence_transformers import SentenceTransformer

# load 
model = SentenceTransformer("matthewleechen/lt-patent-inventor-linking")

Usage (LinkTransformer)

To use this model for clustering with LinkTransformer installed:

import linktransformer as lt
import pandas as pd

df_lm_matched = lt.cluster_rows(df, # df should be a dataset of unique patent-inventors 
                                model='matthewleechen/lt-patent-inventor-linking',
                                on=['name', 'occupation', 'year', 'address', 'firm', 'patent_title'], # cluster on these variables 
                                cluster_type='SLINK', # use SLINK algorithm 
                                cluster_params={ # default params 
                                    'threshold': 0.1,
                                    'min cluster size': 1,
                                    'metric': 'cosine'
                                }
    )
)

Evaluation

We evaluate using the standard LinkTransformer information retrieval metrics. Our test set evaluations are available here.

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 31 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

linktransformer.modified_sbert.losses.SupConLoss_wandb

Parameters of the fit()-Method:

{
    "epochs": 100,
    "evaluation_steps": 16,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 3100,
    "weight_decay": 0.01
}
LinkTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

Citation

If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:

@article{bct2025,
  title = {300 Years of British Patents},
  author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
  journal = {arXiv preprint arXiv:2401.12345},
  year = {2025},
  url = {https://arxiv.org/abs/2401.12345}
}

Please also cite the original LinkTransformer authors:

@misc{arora2023linktransformer,
                  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
                  author={Abhishek Arora and Melissa Dell},
                  year={2023},
                  eprint={2309.00789},
                  archivePrefix={arXiv},
                  primaryClass={cs.CL}
                }
Downloads last month
4
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.