Safetensors
llama
File size: 2,747 Bytes
ee6b07c
 
 
 
 
 
 
 
 
 
 
 
 
 
1f44e11
 
 
 
 
 
ee6b07c
 
 
 
 
 
 
 
 
 
 
 
bdf8cd7
ee6b07c
 
 
 
 
 
 
 
 
1836d39
ee6b07c
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
datasets:
- PleIAs/common_corpus
language:
- en
- fr
- es
- de
- it
- la
- nl
- pl
---

<div style="text-align: center;">
  <img src="https://raw.githubusercontent.com/Pleias/logos/d6152d7943905da32a1e04fdfd7708ed9c7eed5e/PleIAs%201_0%20Full%20Logo%20(Black).png" style="width: 80%; margin: 0 auto; display: inline-block;"/>
</div>

**Pleias-3b-Preview** is an early preview of a 3 billion parameters base model trained by [Pleias](https://huggingface.co/PleIAs) on [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus).

Like all the base and specialized models from Pleias, Pleias-3b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.

## Description
Pleias-3b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.

It includes the following features, that would apply to any responsibly trained variant:
* Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
* Extensive multilingual support for main European languages.
* A new tokenizer designed for enhanced document processing tasks and better multilingual support.
* Extremely low level of toxicity and problematic content.

Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese. 

## Recommended use
As a base model, Pleias-3b-Preview is only able to run continuation prompts.

Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.1-1.2).

Pleias-3b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.

## Training
Pleias-3b-Preview was fully pretrained at Jean Zay on 192 h100s for about 20 days (compute grant n°GC011015451). Training code relied on Nanotron, the open source library from HuggingFace. We provide the complete settings as a yaml file as part of our release. 

Training schedule includes 518,000 steps (batch size 1,024) on a filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).

## Update
Pleias-3b-Preview is currently released as an early preview.

The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.