--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceTB/smoltalk - HuggingFaceH4/ultrafeedback_binarized base_model: - SmallDoge/Doge-60M language: - en pipeline_tag: question-answering --- # **Doge 60M Instruct**

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, please refer to [Wonderful Matrices](https://arxiv.org/abs/2412.11834), all training details and code are publicly available on the [small-doge](https://github.com/SamllDoge/small-doge) repository. ## Uses ```python from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-60M-Instruct") model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-60M-Instruct", trust_remote_code=True) generation_config = GenerationConfig( max_new_tokens=100, use_cache=True, do_sample=True, temperature=0.8, top_p=0.9, repetition_penalty=1.0 ) steamer = TextStreamer( tokenizer=tokenizer, skip_prompt=True ) prompt = "Hi, how are you doing today?" conversation = [ {"role": "user", "content": prompt} ] inputs = tokenizer.apply_chat_template( conversation=conversation, tokenize=True, return_tensors="pt", ) outputs = model.generate( inputs, tokenizer=tokenizer, generation_config=generation_config, streamer=steamer ) ``` ## Model Details We build the Doge-Instruct by first SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and then DPO on [UltraFeedback Binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). > TODO: The larger model is under training and will be uploaded soon. **SFT**: | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision | |---|---|---|---|---|---|---| | [Doge-20M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-20M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 8e-4 | 0.25M | bfloat16 | | [Doge-60M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-60M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 6e-4 | 0.25M | bfloat16 | **DPO**: | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision | |---|---|---|---|---|---|---| | [Doge-20M-Instruct](https://huggingface.co/SmallDoge/Doge-20M-Instruct) | [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | 2 | 1024 | 8e-5 | 0.125M | bfloat16 | | [Doge-60M-Instruct](https://huggingface.co/SmallDoge/Doge-60M-Instruct) | [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | 2 | 1024 | 6e-5 | 0.125M | bfloat16 | **Procedure**: **SFT**: [

](https://wandb.ai/loser_cheems/huggingface/runs/ckbn4b5m) **DPO**: [

](https://wandb.ai/loser_cheems/huggingface/runs/3nk7mu5a) **Environment**: - Image: nvcr.io/nvidia/pytorch:24.12-py3 - Hardware: 1x NVIDIA RTX 4090 - Software: Transformers, TRL ## Citation ```bibtex @misc{shi2024wonderfulmatrices, title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, author={Jingze Shi and Bingheng Wu}, year={2024}, eprint={2412.11834}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2412.11834}, } ```