davidheineman
/

colbert-acl

Model card Files Files and versions Community

colbert-acl / index /metadata.json

davidheineman

update index

00837f7 5 months ago

raw

history blame

4.85 kB

	{
	"config": {
	"query_token_id": "[unused0]",
	"doc_token_id": "[unused1]",
	"query_token": "[Q]",
	"doc_token": "[D]",
	"ncells": null,
	"centroid_score_threshold": null,
	"ndocs": null,
	"load_index_with_mmap": false,
	"index_path": "index",
	"nbits": 2,
	"kmeans_niters": 4,
	"resume": false,
	"similarity": "cosine",
	"bsize": 64,
	"accumsteps": 1,
	"lr": 3e-6,
	"maxsteps": 500000,
	"save_every": null,
	"warmup": null,
	"warmup_bert": null,
	"relu": false,
	"nway": 2,
	"use_ib_negatives": false,
	"reranker": false,
	"distillation_alpha": 1.0,
	"ignore_scores": false,
	"model_name": null,
	"query_maxlen": 32,
	"attend_to_mask_tokens": false,
	"interaction": "colbert",
	"dim": 128,
	"doc_maxlen": 512,
	"mask_punctuation": true,
	"checkpoint": "colbert-ir\/colbertv2.0",
	"triples": null,
	"collection": [
	"list with 67577 elements starting with...",
	[
	"In recent years, several end-to-end online translation systems have been proposed to successfully incorporate human post-editing feedback in the translation workflow. The performance of these systems in a multi-domain translation environment (involving different text genres, post-editing styles, machine translation systems) within the automatic post-editing (APE) task has not been thoroughly investigated yet. In this work, we show that when used in the APE framework the existing online systems are not robust towards domain changes in the incoming data stream. In particular, these systems lack in the capability to learn and use domain-specific post-editing rules from a pool of multi-domain data sets. To cope with this problem, we propose an online learning framework that generates more reliable translations with significantly better quality as compared with the existing online and batch systems. Our framework includes: i) an instance selection technique based on information retrieval that helps to build domain-specific APE systems, and ii) an optimization procedure to tune the feature weights of the log-linear model that allows the decoder to improve the post-editing quality.",
	"We assessed how different machine translation (MT) systems affect the post-editing (PE) process and product of professional English\u2013Spanish translators. Our model found that for each 1-point increase in BLEU, there is a PE time decrease of 0.16 seconds per word, about 3-4%. The MT system with the lowest BLEU score produced the output that was post-edited to the lowest quality and with the highest PE effort, measured both in HTER and actual PE operations.",
	"Computer-aided translation (CAT) tools often use a translation memory (TM) as the key resource to assist translators. A TM contains translation units (TU) which are made up of source and target language segments; translators use the target segments in the TU suggested by the CAT tool by converting them into the desired translation. Proposals from TMs could be made more useful by using techniques such as fuzzy-match repair (FMR) which modify words in the target segment corresponding to mismatches identified in the source segment. Modifications in the target segment are done by translating the mismatched source sub-segments using an external source of bilingual information (SBI) and applying the translations to the corresponding positions in the target segment. Several combinations of translated sub-segments can be applied to the target segment which can produce multiple repair candidates. We provide a formal algorithmic description of a method that is capable of using any SBI to generate all possible fuzzy-match repairs and perform an oracle evaluation on three different language pairs to ascertain the potential of the method to improve translation productivity. Using DGT-TM translation memories and the machine system Apertium as the single source to build repair operators in three different language pairs, we show that the best repaired fuzzy matches are consistently closer to reference translations than either machine-translated segments or unrepaired fuzzy matches."
	]
	],
	"queries": null,
	"index_name": "index",
	"overwrite": false,
	"root": "\/coc\/pskynet6\/dheineman3\/colbert-acl\/acl-search\/experiments",
	"experiment": "notebook",
	"index_root": null,
	"name": "2024-08\/24\/01.35.22",
	"rank": 0,
	"nranks": 2,
	"amp": true,
	"gpus": 4
	},
	"num_chunks": 3,
	"num_partitions": 32768,
	"num_embeddings": 12298798,
	"avg_doclen": 181.996803646211
	}