CALM: Collaborative Arabic Language Model

The CALM project is joint effort lead by NCAI in collaboration with Yandex and HuggingFace to train an Arabic language model with volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration: Training Transformers Together. TODO In this demo, we train a model similar to OpenAI DALL-E — a Transformer "language model" that generates images from text descriptions. Training happens collaboratively — volunteers from all over the Internet contribute to the training using hardware available to them. We use LAION-400M, the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on the dalle‑pytorch implementation by Phil Wang with a few tweaks to make it communication-efficient.

See details about how to join and how it works on our website.

This organization gathers people participating in the collaborative training and provides links to the necessary resources:

👉 Starter kits for Google Colab and Kaggle (easy way to join the training)
👉 Dashboard (the current training state: loss, number of peers, etc.)
👉 Colab notebook for running inference
👉 Model weights (the latest checkpoint)
👉 Weights & Biases plots for aux peers (aggregating the metrics) and actual trainers (contributing with their GPUs)
👉 Code
👉 Dataset

Feel free to reach us on Discord if you have any questions 🙂

# Once of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with leading performane on Arabic NLP tasks, such as AraBERT, CamelBERT, AraELECTRA, and QARiB, took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively. CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets. Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating volunteer. Details of the distributed training process are further described in the paper Deep Learning in Open Collaborations.

How to participate in training?

To join the collaborative training, all you have to do is to keep a notebook running for at least 15 minutes, you're free to close it after that and join again in another time. There are few steps before running the notebook:

Create an account on Huggingface.
Join the NCAI-CALM Organization on Huggingface through the invitation link shared with you by email.
Get your Access Token, it's later required in the notebook
1. Go to your HF account
2. Go to Settings ⇒ Access Tokensv
3. Generate a new Access Token and enter any name for "what's this token for"
4. Select read role
5. Copy your access token
6. Paste it in the execution prompt in the notebook

Start training

Pick one of the following methods to run the training code.
NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.

(recommended)
Running locally
If you have additional local computing GPUs, please visit our discord channel for instructions to set it.

Issues or questions?

We are there to provide any assistance needed, please make sure to join our