--- license: mit datasets: - HuggingFaceTB/cosmopedia - bigcode/starcoderdata - shivendrra/consolidated-datasets language: - en tags: - transformers - bert - decoder-only - encoder-decoder - mixture of experts - moe - MoE - aiva-500m - transformer model - llm - small scale model --- # aiva-4x500m ## Model Details This is a transformer based model trained on [cosmopedia] and [starcoder] datasets. This is able to generate new sequences and classify the emotions and sentiments in the speech. Uses MoE same as Mistral's 8x7b model, but uses 4 of 500million models. For now it only has the language models, but I'm working on vision and audio model which will be uploaded soon. ### Model Description - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_) - **License:** [MIT] - **Train loss:** 0.2035 - **Accuracy:** Not yet determined(for next token prediction) ### Model Sources - **Repository:** [github/aiva-4x500m](https://github.com/shivendrra/AIVA-4x500m) - **Papers:** None ## Uses For now, language model can be used to generate new tokens, masked token prediction and sentiment analysis. But in future, it will be paired along with the audio and vision models to make it work like AVA from *ex-machina*. It could listen to the human, talk to them and understand sentiments, emotions, and actions using it's vision and audio capabilities. ## Training Details ### Training Data --- Used from this dataset: [cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia), [shivendrra/consolidated-datasets](https://huggingface.co/datasets/shivendrra/consolidated-datasets), [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) ### Training Procedure --- Transformer based model was trained for 35k iteration on 3.5billion tokens for more around 25hrs on google colab's T4 gpu. I had access to a lot more data but I didn't train it further because of budget issues and technical limitations. #### Functions: This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations. ```python def get_batch(split):     # generate a small batch of data of inputs x and targets y     data = train_data if split == 'train' else val_data     ix = torch.randint(len(data) - block_size, (batch_size,))     x = torch.stack([data[i:i+block_size] for i in ix])     y = torch.stack([data[i+1:i+block_size+1] for i in ix])     x, y = x.to(device), y.to(device)     return x, y @torch.no_grad() def estimate_loss():     out = {}     model.eval()     for split in ['train', 'val']:         losses = torch.zeros(eval_iters)         for k in range(eval_iters):             X, Y = get_batch(split)             logits, loss = model(X, Y)             losses[k] = loss.item()         out[split] = losses.mean()     model.train()     return out for iter in range(max_iters):   if iter % eval_interval == 0 or iter == max_iters - 1:     losses = estimate_loss()     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")   xb, yb = get_batch('train')   logits, loss = model(xb, yb)   optimizer.zero_grad(set_to_none=True)   loss.backward()   optimizer.step() ``` #### Training Hyperparameters Configurations are saved in the `base/config.json` file. suitable for 500million encoder-decoder model. ```json { "batch_size": 10, "block_size": 256, "max_iters": 5000, "eval_interval": 50, "learning_rate": 3e-5, "eval_iters": 100, "d_model": 512, "n_head": 18, "n_layer": 12, "dropout": 0.2, "norm_eps": 1e-5 } ``` ### Model Architecture and Objective There is one trained model uploaded for now, a 536million parameter transformer model that is trained for over 35k iterations. It uses RMS norm and has context size of 256-tokens only. `tiktoken` is used for tokenization, and tokenization file is also included configured accordingly to the trained model Decoder-based model isn't uploaded for now, it's a little hard to train due to it's complexity. But will be uploaded soon. ### Highlights 1. **RMS Normalization & Pre-normalization:** Both of the model uses RMS normalization same as implemented in LLaMa-2 and uses pre-normalization for model's stability while training. 2. **Self-Attention Layer:** Encoder and Final attention layer's have no masking and the key, query and values have bias added to them. Decoder-Attention layer has a triangular mask applied to them, without any biases. Also, Encoder-attention has relative positional embeddings added to attention matrix, before `softmax`. 3. **FeedForward:** Basic feed-forward network that has two linear layers with expansion factor of 5. GELU is used as activation function for this model instead of ReLU. 4. **Generation:** Token generation function uses top_k, top_p and beaming along with temperature scaling, but there is some bug, because it's not working as it supposed to work. I'll try to correct it and then upload again.