xiangruiy commited on
Commit
f70cab6
·
verified ·
1 Parent(s): fe0df2d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +249 -0
README.md ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ library_name: nemo
4
+ tags:
5
+ - pytorch
6
+ - NeMo
7
+ ---
8
+
9
+ # Llama2-13b-nemo
10
+
11
+ <style>
12
+ img {
13
+ display: inline;
14
+ }
15
+ </style>
16
+
17
+ **Put a short model description here.**
18
+
19
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
20
+
21
+
22
+ ## NVIDIA NeMo: Training
23
+
24
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
25
+ ```
26
+ pip install nemo_toolkit['all']
27
+ ```
28
+
29
+ ## How to Use this Model
30
+
31
+ The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
32
+
33
+ ### Automatically instantiate the model
34
+
35
+ **NOTE**: Please update the model class below to match the class of the model being uploaded.
36
+
37
+ ```python
38
+ import nemo.core import ModelPT
39
+ model = ModelPT.from_pretrained("pe-nlp/llama2-13b-nemo")
40
+ ```
41
+
42
+ ### NOTE
43
+
44
+ Add some information about how to use the model here. An example is provided for ASR inference below.
45
+
46
+ ### Transcribing using Python
47
+ First, let's get a sample
48
+ ```
49
+ wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
50
+ ```
51
+ Then simply do:
52
+ ```
53
+ asr_model.transcribe(['2086-149220-0033.wav'])
54
+ ```
55
+
56
+ ### Transcribing many audio files
57
+
58
+ ```shell
59
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="pe-nlp/llama2-13b-nemo" audio_dir=""
60
+ ```
61
+
62
+ **Input**
63
+ Models input text only.
64
+
65
+ **Output**
66
+ Models generate text only.
67
+
68
+ **Model Architecture**
69
+ Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.
70
+
71
+
72
+ ||Training Data|Params|Content Length|GQA|Tokens|LR|
73
+ |---|---|---|---|---|---|---|
74
+ |Llama 2|*A new mix of publicly available online data*|7B|4k|&#10007;|2.0T|3.0 x 10<sup>-4</sup>|
75
+ |Llama 2|*A new mix of publicly available online data*|13B|4k|&#10007;|2.0T|3.0 x 10<sup>-4</sup>|
76
+ |Llama 2|*A new mix of publicly available online data*|70B|4k|&#10004;|2.0T|1.5 x 10<sup>-4</sup>|
77
+
78
+ *Llama 2 family of models.* Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability.
79
+
80
+ ## Training
81
+
82
+ **Training Factors** We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.
83
+
84
+ **Carbon Footprint** Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program.
85
+
86
+ ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO<sub>2</sub>eq)|
87
+ |---|---|---|---|
88
+ |Llama 2 7B|184320|400|31.22|
89
+ |Llama 2 13B|368640|400|62.44|
90
+ |Llama 2 70B|1720320|400|291.42|
91
+ |Total|3311616||539.00|
92
+
93
+ **CO<sub>2</sub> emissions during pretraining.** Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others.
94
+
95
+
96
+ ### NOTE
97
+
98
+ An example is provided below for ASR
99
+
100
+ The NeMo toolkit [1] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
101
+
102
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
103
+
104
+
105
+ ### Datasets
106
+
107
+ **Overview**
108
+ Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.
109
+
110
+ **Data Freshness**
111
+ The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.
112
+
113
+ ### NOTE
114
+
115
+ An example for the manifest section is provided below for ASR datasets
116
+
117
+ datasets:
118
+ - librispeech_asr
119
+ - fisher_corpus
120
+ - Switchboard-1
121
+ - WSJ-0
122
+ - WSJ-1
123
+ - National-Singapore-Corpus-Part-1
124
+ - National-Singapore-Corpus-Part-6
125
+ - vctk
126
+ - voxpopuli
127
+ - europarl
128
+ - multilingual_librispeech
129
+ - mozilla-foundation/common_voice_8_0
130
+ - MLCommons/peoples_speech
131
+
132
+ The corresponding text in this section for those datasets is stated below -
133
+
134
+ The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
135
+
136
+ The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
137
+
138
+ - Librispeech 960 hours of English speech
139
+ - Fisher Corpus
140
+ - Switchboard-1 Dataset
141
+ - WSJ-0 and WSJ-1
142
+ - National Speech Corpus (Part 1, Part 6)
143
+ - VCTK
144
+ - VoxPopuli (EN)
145
+ - Europarl-ASR (EN)
146
+ - Multilingual Librispeech (MLS EN) - 2,000 hour subset
147
+ - Mozilla Common Voice (v7.0)
148
+ - People's Speech - 12,000 hour subset
149
+
150
+
151
+ ## Performance
152
+
153
+ In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library.
154
+
155
+ |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval|
156
+ |---|---|---|---|---|---|---|---|---|---|
157
+ |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9|
158
+ |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9|
159
+ |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7|
160
+ |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6|
161
+ |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3|
162
+ |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1|
163
+ |Llama 2|70B|**37.5**|**71.9**|**63.6**|**69.4**|**35.2**|**68.9**|**51.2**|**54.2**|
164
+
165
+ **Overall performance on grouped academic benchmarks.** *Code:* We report the average pass@1 scores of our models on HumanEval and MBPP. *Commonsense Reasoning:* We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. *World Knowledge:* We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. *Reading Comprehension:* For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. *MATH:* We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1.
166
+
167
+ |||TruthfulQA|Toxigen|
168
+ |---|---|---|---|
169
+ |Llama 1|7B|27.42|23.00|
170
+ |Llama 1|13B|41.74|23.08|
171
+ |Llama 1|33B|44.19|22.57|
172
+ |Llama 1|65B|48.71|21.77|
173
+ |Llama 2|7B|33.29|**21.25**|
174
+ |Llama 2|13B|41.86|26.10|
175
+ |Llama 2|70B|**50.18**|24.60|
176
+
177
+ **Evaluation of pretrained LLMs on automatic safety benchmarks.** For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better).
178
+
179
+
180
+ |||TruthfulQA|Toxigen|
181
+ |---|---|---|---|
182
+ |Llama-2-Chat|7B|57.04|**0.00**|
183
+ |Llama-2-Chat|13B|62.18|**0.00**|
184
+ |Llama-2-Chat|70B|**64.14**|0.01|
185
+
186
+ **Evaluation of fine-tuned LLMs on different safety datasets.** Same metric definitions as above.
187
+
188
+ ### NOTE
189
+
190
+ An example is provided below for ASR metrics list that can be added to the top of the README
191
+
192
+ model-index:
193
+ - name: PUT_MODEL_NAME
194
+ results:
195
+ - task:
196
+ name: Automatic Speech Recognition
197
+ type: automatic-speech-recognition
198
+ dataset:
199
+ name: AMI (Meetings test)
200
+ type: edinburghcstr/ami
201
+ config: ihm
202
+ split: test
203
+ args:
204
+ language: en
205
+ metrics:
206
+ - name: Test WER
207
+ type: wer
208
+ value: 17.10
209
+ - task:
210
+ name: Automatic Speech Recognition
211
+ type: automatic-speech-recognition
212
+ dataset:
213
+ name: Earnings-22
214
+ type: revdotcom/earnings22
215
+ split: test
216
+ args:
217
+ language: en
218
+ metrics:
219
+ - name: Test WER
220
+ type: wer
221
+ value: 14.11
222
+
223
+ Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
224
+
225
+ It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
226
+
227
+ ## Limitations
228
+
229
+ Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model.
230
+
231
+ Please see the Responsible Use Guide available at [https://ai.meta.com/llama/responsible-use-guide/](https://ai.meta.com/llama/responsible-use-guide)
232
+
233
+
234
+ ### Note
235
+
236
+ An example is provided below
237
+
238
+ Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
239
+
240
+
241
+ ## License
242
+
243
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
244
+
245
+ ## References
246
+
247
+ **Provide appropriate references in the markdown link format below. Please order them numerically.**
248
+
249
+ [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)