--- license: apache-2.0 arxiv: 2001.00059 pipeline_tag: fill-mask tags: - code - cubert --- # CuBERT: Learning and Evaluating Contextual Embedding of Source Code ## Overview This model is the unofficial HuggingFace version of "[CuBERT](https://github.com/google-research/google-research/tree/master/cubert)". In particular, this version comes from [gs://cubert/20210711_Python/pre_trained_model_epochs_2__length_512](https://console.cloud.google.com/storage/browser/cubert/20210711_Python/pre_trained_model_epochs_2__length_512). It was trained 2021-07-11 for 2 epochs with a 512 token context window on the Python BigQuery dataset. I manually converted the Tensorflow checkpoint to PyTorch and have uploaded it here. The [tokenizer](https://github.com/google-research/google-research/blob/master/cubert/python_tokenizer.py) has not been converted yet. All credit goes to Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Citation: ```bibtex @inproceedings{cubert, author = {Aditya Kanade and Petros Maniatis and Gogul Balakrishnan and Kensen Shi}, title = {Learning and evaluating contextual embedding of source code}, booktitle = {Proceedings of the 37th International Conference on Machine Learning, {ICML} 2020, 12-18 July 2020}, series = {Proceedings of Machine Learning Research}, publisher = {{PMLR}}, year = {2020}, } ```