philschmid HF staff commited on
Commit
8da99c4
·
verified ·
1 Parent(s): eef45b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -40
README.md CHANGED
@@ -6,63 +6,70 @@ tags:
6
  - generated_from_trainer
7
  - trl
8
  - grpo
9
- licence: license
 
 
10
  ---
11
 
12
- # Model Card for qwen-2.5-3b-r1-countdown
13
 
14
  This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
15
- It has been trained using [TRL](https://github.com/huggingface/trl).
 
 
 
16
 
17
  ## Quick start
18
 
19
  ```python
20
- from transformers import pipeline
21
-
22
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
- generator = pipeline("text-generation", model="philschmid/qwen-2.5-3b-r1-countdown", device="cuda")
24
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
- print(output["generated_text"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```
27
 
28
  ## Training procedure
29
 
30
-
31
-
32
-
33
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
34
 
35
  ### Framework versions
36
 
37
- - TRL: 0.14.0.dev0
38
  - Transformers: 4.48.1
39
  - Pytorch: 2.5.1+cu121
40
  - Datasets: 3.1.0
41
  - Tokenizers: 0.21.0
42
-
43
- ## Citations
44
-
45
- Cite GRPO as:
46
-
47
- ```bibtex
48
- @article{zhihong2024deepseekmath,
49
- title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
50
- author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
51
- year = 2024,
52
- eprint = {arXiv:2402.03300},
53
- }
54
-
55
- ```
56
-
57
- Cite TRL as:
58
-
59
- ```bibtex
60
- @misc{vonwerra2022trl,
61
- title = {{TRL: Transformer Reinforcement Learning}},
62
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
63
- year = 2020,
64
- journal = {GitHub repository},
65
- publisher = {GitHub},
66
- howpublished = {\url{https://github.com/huggingface/trl}}
67
- }
68
- ```
 
6
  - generated_from_trainer
7
  - trl
8
  - grpo
9
+ - r1
10
+ - rl
11
+ licence: qwen-research
12
  ---
13
 
14
+ # Model Card for `qwen-2.5-3b-r1-countdown` a mini R1 experiments
15
 
16
  This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
17
+ It has been trained using [TRL](https://github.com/huggingface/trl) and GRPO on the Countdown game.
18
+
19
+ If you want to learn how to replicate this model and reproduce your own Deepseek R1 "aha" moment, check out my [blog post](https://www.philschmid.com/mini-deepseek-r1).
20
+
21
 
22
  ## Quick start
23
 
24
  ```python
25
+ from vllm import LLM, SamplingParams
26
+ from datasets import load_dataset
27
+ from random import randint
28
+
29
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
30
+
31
+ # use revision without "checkpoints-" as vLLM downloads all of them
32
+ llm = LLM(model="philschmid/qwen-2.5-3b-r1-countdown", revision="099c0f8cbfc522e7c3a476edfb749f576b164539")
33
+
34
+ # Load dataset from Hugging Face Hub
35
+ dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
36
+ dataset = load_dataset(dataset_id, split="train")
37
+ sample = dataset[randint(0, len(dataset))]
38
+
39
+ # create conversation
40
+ messages = [
41
+ {"role": "system", "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."},
42
+ {"role": "user", "content": f"Using the numbers {sample['nums']}, create an equation that equals {sample['target']}. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>. Think step by step inside <think> tags."},
43
+ {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
44
+ ]
45
+ # generate response
46
+ res = llm.generate(llm.get_tokenizer().apply_chat_template(messages, tokenize=False, continue_final_message=True), sampling_params)
47
+ res = "<think>" + res[0].outputs[0].text
48
+ print(res)
49
+
50
+ # <think> We need to use the numbers 37, 15, 4, and 13 with basic arithmetic operations to make 16. Let's try different combinations:
51
+ # - 37 - 15 - 4 - 13 = 6 (too low)
52
+ # - 37 - 15 + 4 - 13 = 13 (too low)
53
+ # - 37 + 15 - 4 - 13 = 35 (too high)
54
+ # - 37 - 15 + 4 + 13 = 39 (too high)
55
+ # - 15 + 4 + 13 - 37 = -1 (too low)
56
+ # - 37 + 15 + 4 - 13 = 43 (too high)
57
+ # - 15 + 4 * 13 / 37 = 15 + 52 / 37 (not an integer)
58
+ # - 15 * 4 / 37 - 37 = -28.24 (not a whole number)
59
+ # - 4 * 13 / 15 - 37 = 41.3333 (not a whole number)
60
+ # After all combinations, I got not any integer result as 16.
61
+ # </think>
62
+ # <answer> 37 - 15 + 4 + 13 </answer>
63
  ```
64
 
65
  ## Training procedure
66
 
 
 
 
67
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
68
 
69
  ### Framework versions
70
 
71
+ - TRL: 0.14.0
72
  - Transformers: 4.48.1
73
  - Pytorch: 2.5.1+cu121
74
  - Datasets: 3.1.0
75
  - Tokenizers: 0.21.0