Congratulations on your new Huggingface Repo!
I’ve been very excited about your CBETA translation project since I learned of it your ANNOUNCEMENT posted in the FB Digital Humanities group a few months ago have been looking forward to seeing it up on HuggingFace. I’ve already looked for it a few times and didn’t see it but figured I might have missed it somehow. But I looked today and there it was!
I figured it would be some form of encoder-decoder but I wasn’t sure what kind. Now I see it’s MBART – seems like an excellent choice!
Is this discussion page a good place to ask questions, or would you recommend a better one?
Hi John,
Thank you very much for your message! Indeed we uploaded the model just today since there have been (and still are) some issues with the model behavior. Most of it is sorted out now, some tasks remain to be solved in the future.
Sure feel free to ask any questions that you might have here!
Best,
Sebastian
Hi Sebastian,
It’s great to know you welcome questions. Here’s a couple from me.:)
I’m just starting to explore translation with transformers and think in terms of the original encoder-decoder architecture of the 2017 All You Need is Attention paper and think of it as the ultimate baseline. Do you know how MBART perform compares to it when trained on the same sentence-pair dataset.
Also, do you know if it’s it possible to use non-sentence-pair as well as sentence pair training data to improve the modeling of source language in the encoder and of the target language in the decoder? Maybe in separate pre-training and fine-tuning steps?
-- John
As to my second question, about using unpaired data to supplement paired data, it makes it clear that training MBART is a two-step training process consisting of (1) pre-training on a multilingual set of unpaired language-tagged sentences in autoencoder fashion, followed by (2) fine-tuning on a set of paired language-tagged sentences like the original 2017 encoder-decoder. The authors note that the improvement gotten from fine-tuning goes down as the number of paired sentences gets larger and can ultimately “wash out” the effects of the pre-training, but is usually significant in the typical use case where the amount of pair data is modest in comparison with that of the unpaired data.
As to my first question, about performance of MBART compared to the original transformer, the authors don’t seem to address it directly. But they do use for their baselines a pre-trained MBART model initialized with random parameters and not changed by pre-training, which they called Random, and then fine-tuned with paired data. The performance of the Random model starts out at 0 but goes up with fine-tuning and with enough pair data can get into the same performance range as a real pre-training model with the same amount of fine-tuning and ultimately match it. I’m thinking that their Random MBART model is essentially equivalent to an original transformer model. If so, the original model would show similar performance (plus or minus other improvements in MBART) to the Random model. But of more practical significance, it means that an original transfer model trained with a given set of source-target pair data will not perform as well as an MBART model pre-trained with unpaired data for the source and target languages and fine-tuned with the same pair data as the original model.
I’m not sure I have all this right, but given that you created your Linguae Dharmae model using MBART, I’m curious: did you include unpaired Buddhist Classical Chinese and Buddhist translation English (or any Chinese and any English data for that matter) beyond what’s in the paired data in your multilingual pre-training data? I understand that you may not want to talk about your system at this level of detail in this venue, but I’d be interested to hear anything you have to say on this topic.
-- John
A P.S. to my previous message. I realize after looking at your repo and the linked huggingface docs that the most likely answer as to what pre-training data you used was the data embodied in the already pretrained hf MBART model (or more likely the MBART-50 model to judge from your tokenizer config ) and that you did not need to do any pre-training at all. (Though maybe you could have done an additional pre-training checkpoint adding Buddhist or other Classical Chinese texts to the zh_CN data in the model. The fact that you are depending on the pre-trained zh_CN = PRC Chinese data for your Chinese modeling fits with the instruction in your README.md to make sure the Chinese input data you pass to the inference engine is in Simplified = PRC spelling.)
Hi John,
Regarding the question of performance of a vanilla transformer vs. pretrained encoder-decoder models such as MT5 or MBART, the consensus here is that for low resource languages (<1M sentence paris training), the pretrained models can perform better (always depends on the individual case of course). Once enough training data is available, the difference diminishes.
To my understanding as well, an MBART model initialized with random parameters is essentially just a Transformer.
The current Linguae Dharmae Chinese->English model has not seen any additional monolingual pretraining for Chinese, English or any other language, but we are working on this.
Indeed we finetuned the MBART-50-many-to-one model. We are preparing a detailed paper where we will compare all the different settings, parameters etc. that we have tried so far.
And yes we use simplified Chinese since that is what the tokenizers of the pretrained models generally use.
Hi Sebastian,
Apologies for not seeing your comment and responding to it sooner. Thanks for sanity checking my previous suppositions and for sharing information about your training.
I am excited to hear that you and your team are working on the paper you describe and look forward to reading it when it comes out. I hope you will consider putting it in an accessible online place and including a link to it in your model card.
I’m very interested in the potential of Deep Learning to play a significant role in Digital Humanities, especially in fields in which texts are abundant and a major focus of study. Your Linguae Dharmae project belongs to two such fields, Sinology and Buddhist Studies, and is probably the first or second practical Deep Learning-based NLP application in both fields. (I believe some systems now used for Chinese OCR and possibly also character component analysis are also Deep Learning-based, though they may not use transformers.)
But IMO your project can also claim a place the field of AI research, on the grounds that you, as “domain experts”, are able to evaluate the performance of your system at a much more sophisticated level than the dominant practice within that field, which mostly involves running standard benchmarks with standard tasks and datasets and evaluating the results with standard metrics (in your case BLEU).
You have already identified recurring problems (for example relating to personal and place names), and for specific changes that improve performance. Even if you stay with your current architecture and limit your experiments to changes in parameters and training data, your findings should be of interest to AI researchers, and could even inspire them to make architectural changes that will give better results “out of the box”. Or you could move on to experimenting with changes in the architecture and maybe find some good ones yourselves and pass them on to the AI research community.
An area where this seems especially likely to me is vocabulary control, including both single languages models and on all the languages in a multi-language model – or maybe just a selection them – all at once. One possibility I can imagine is a new fine-tuning head for a large pre-trained single or multi-language model body. It could be a self-standing head that could be run on the original pre-trained model to produce a new version of the pre-trained model that the existing fine-tuning head could then be run on, or it could be prepended to the existing fine-tuning head that in its modified form could be run directly on the original pre-trained model.
I’m not looking for any answers here, but I’d be interested in any comments you might have.
Hi John,
Thank you very much for your encouraging remarks! Indeed, in some applications such as OCR deepl learning techniques have been used for a while on Chinese data as well. Also the current BLOOM model is getting very good at Literary and Buddhist Chinese content (just ask it a question such as '第八識謂' for a demonstration): https://huggingface.co/bigscience/bloom
So yes, in the coming years we will very likely see big changes to how we apply deep learning techniques to Chinese source material.
I hope we can make a contribution to AI research as such, let's see! Your thoughts on how to tackle vocabulary control are very interesting. I think this is perhaps one of the biggest weaknesses of the current neural based models, as impressive as they are on translation tasks, the unpredictability of named entity translation is an issue that even with bigger amounts of training data seems to be difficult to resolve. We will see what we can achieve here.... Its a bit early yet to propose changes in the architecture as we don't even have the 'baseline' for Buddhist NMT settled, but the ideas you gave a certainly worthwhile to explore at a point further down the road.
This is all very exciting indeed!
Hi Sebastien,
I wasn’t aware of the BigScience Bloom project, thanks for bringing it to my attention. But so far I haven’t noticed any instances of applying big multilingual causal LM decoders (GPT-type models) to translation or related tasks such as cross-language summarization. Also, unless I missed it, there don’t seem to be any ways to fine-tune them after (pre-)training such as are there are with multilingual encoders (mBERT etc.) and encoder-decoder combos (mBART etc.). Yes you can do another round of training the base model with additional data pertinent to your task, but otherwise all you have is one- or few-shot learning via prompts.
Big LMs can already do surprisingly well on NLP tasks classed as NLU (as in Glue and SuperGLUE), and no doubt will keep getting getter - and this includes the multilingual models, which seem to exhibit significant transfer between languages. Again I may have missed something, but as far as I know their ability to do translation has not yet been demonstrated. However there is one very big LM that is trained on equal, very large, amounts of English and Chinese text, namely Beijing Academy of Artificial Intelligence (BAAI)’s 悟道Wu Dao. I’ve only read a little about it and I saw no mention of translation, but that doesn’t mean it can’t do it. And all it would take to find out is to access the public API and try a prompt or two. :) But even if it can’t, it is still possible that the ability will “emerge” as the models get bigger and become more balanced in their language coverage.
Re what I call vocabulary control, I was browsing through the papers in Proceedings of the Sixth Conference on Machine Translation (2022), all of which are linked (with pdf download options) here:
https://aclanthology.org/volumes/2021.wmt-1/
and I discovered a shared task in the Findings of the 2021 Conference on Machine Translation (WMT21) called “machine translation using terminologies”. They are treated in a separate Document Alam et al., 2021, Findings of the WMT Shared Task on Machine Translation Using Terminologies (link in Proceedings). The domain was COVID-19, a newcomer (as we all remember) to general biomedical field.
The provided training data included, for each language pair, parallel terminology lists and a substantial number of source and target sentence pairs annotated identically with the source and target terms. Most of the entered system used their own annotations of the training data (which they could derive from the provided ones ) to “force” the desired target terms into the translations and got positive results from doing so. But as the authors of the findings observe in their discussion, annotations are a rigid approach as well as av sometimes brittle one (especially in the face of non-trivial morphology), and that while it may be appropriate in which a completely consistent mapping of terms is required, it may not be the best choice for general translation tasks.
But one system, from Huawei, (Wang et al.,2021, “Huawei aarc’s submission to the wmt21 biomedical task: domain adaptation as a practical task), is different. It makes no use of annotations and does inference on “raw” source inputs, no differently from any typical translation system. Yet it scores highest of the English--Chinese task entries on all criteria. And the reason is clear: the base system has already been adapted to the general biomedical domain, including COVID-19. In fact, as far as I can tell, the system used for the terminologies task is identical to that developed for the biomedical task, and did not use the terminologies data for anything except for final testing. (Note that it was disqualified as the official winner because it doesn’t limit itself to the permitted range of training data).
But the Huawei system does incorporate cross-language terminology, including COVID-19 terminology, in an interesting indirect way. It is called Dictionary-Based Data Augmentation (DDA) and is described in detail in Wei Peng et al., 2020, “Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation” (https://arxiv.org/abs/2004.02577). The basic idea is to generate source-target sentence pairs (actually lots of them including multiples for the individual word-pairs) that contain the source words and target words in parallel contexts appropriate for the words. The generated sentence pairs are effectively translations and, as such, and can be used for fine-tuning base generic multilingual model, no differently that whatever “real” sentence pairs are available.
You may already be familiar with this concept, but if not, I thought you might be interested in it. I’m sure that, as a very mature field, Buddhist Studies must have its share of onomastic (or as we would now say, named entity) resources covering people (or rather personal beings, including both human and non-human, mortal and divine), places, and unique or otherwise special things (all cases including both historical and mythological), as well as resources focused on religious concepts and terms. Plus there will inevitably be indices or concordances to the sutras and other foundational texts. And at least some of these resources are bound to be available in digital form.
If the situation is more or less as I imagine it, It should be possible to get Chinese-English lists of various categories of named entities, either directly from already existing sources or by linking through Sanskrit. And, once you select what you want (say, only items occurring in the sutras, or maybe a subset of that), you have your dictionary of terms, with basic categorical information as a bonus. And with that you can start trying out ways to expand them into fabricated sentence pairs to use for additional fine-tuning.
A simple metric of how well a given model is doing on named entities is to get a translation of some sample of your to-be-translated dataset and collect the output sentence pairs that contain at least one dictionary term either on the source side or the target side, and count how many paired terms appear between the two sides (good) vs. mismatched terms or no term at all on one side (not good). Using this metric you can do a new translation of your chosen sample and compare it to the earlier translation, and see how the performance (good – not good) changed. And if results look at all promising, you can start experimenting with your expansion method (including both sentence formation and number of sentence pairs to be generated and go on from there.
HI John,
Thank you very much for all this input! I am a bit overwhelmed (to be honest), but I will see if we can digest this in the coming months.
Best, Sebastian
Hi Sebestian,
It's a tricky problem, I wish you luck with it. By a strange coincidence I'm now grappling with a similar problem, and am feeling a bit overwhelmed myself, despite the research I did a few weeks ago.
-- John