Abstract
Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.
Community
YuLan-Mini, a high-performance base model with 2.4B parameters, open source all the training techniques and dataset composition, making it possible for the community to train their own top-tier LLM. It is entirely pre-trained from scratch by the research team at Renmin University of China (RUC).
To promote the accessibility of high-performance models in the open-source community and facilitate subsequent research on data curriculum design and annealing data selection, we have fully released the following pre-training details:
- Detailed data composition at each curriculum phase
- Open-source and synthetic datasets
- Intermediate checkpoints at each stage
- Optimizer states before annealing
What you can do with these pre-training resources
- Pre-train your own LLM. You can use our data and curriculum to train a model that’s just as powerful as YuLan-Mini.
- Perform your own learning rate annealing. During the annealing phase, YuLan-Mini’s learning ability is at its peak. You can resume training from the checkpoint before annealing and use your own dataset for learning rate annealing.
- Fine-tune the Instruct version of the LLM. You can use the YuLan-Mini base model to train your own Instruct version.
- Training dynamics research. You can use YuLan-Mini’s intermediate checkpoints to explore internal changes during the pre-training process.
- Synthesize your own data. You can use YuLan-Mini’s data pipeline to clean and generate your own dataset.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training (2024)
- FastDraft: How to Train Your Draft (2024)
- Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque (2024)
- Qwen2.5 Technical Report (2024)
- TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use (2024)
- RedStone: Curating General, Code, Math, and QA Data for Large Language Models (2024)
- Phi-4 Technical Report (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper