--- license: mit language: - en base_model: - microsoft/codebert-base-mlm pipeline_tag: sentence-similarity tags: - smart-contract - web3 - software-engineering - embedding - codebert --- # SmartBERT V3 CodeBERT ![SmartBERT](./framework.png) ## Overview **SmartBERT V3** is a pre-trained programming language model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**. It has been further trained on [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2) with an additional **64,000** smart contracts, to enhance its robustness in representing smart contract code at the _function_ level. - **Training Data:** Trained on a total of **80,000** smart contracts, including **16,000** from [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2) and **64,000** (starts from 30001) new contracts. - **Hardware:** Utilized 2 Nvidia A100 80G GPUs. - **Training Duration:** Over 30 hours. - **Evaluation Data:** Evaluated on **1,500** (starts from 96425) smart contracts. ## Preprocessing All newline (`\n`) and tab (`\t`) characters in the _function_ code were replaced with a single space to ensure consistency in the input data format. ## Base Model - **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm) ## Training Setup ```python training_args = TrainingArguments( output_dir=OUTPUT_DIR, overwrite_output_dir=True, num_train_epochs=20, per_device_train_batch_size=64, save_steps=10000, save_total_limit=2, evaluation_strategy="steps", eval_steps=10000, resume_from_checkpoint=checkpoint ) ``` ## How to Use To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT). ## Contributors - [Youwei Huang](https://www.devil.ren) - [Sen Fang](https://github.com/TomasAndersonFang) ## Sponsors - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/) - CAS Mino (中科劢诺)