tangled-llama-a-128k-base-v0.1
A pretrained language model based on the Llama model with about 62.9M parameters. This model has been trained on 10.6B (10,630,121,844
) tokens from more than 31.3M (31,383,840
) dataset rows.
This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072
) tokens, it was pretrained with sequences of 2K (2048
) tokens.
The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.
loss, val_loss
val_ppl
epoch
learning_rate
Pretrain Evaluation
lm-evaluation-harness
litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.2176 |
± |
0.0121 |
|
|
none |
0 |
acc_norm |
↑ |
0.2560 |
± |
0.0128 |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0190 |
± |
0.0038 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2618 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2592 |
± |
0.0044 |
mmlu |
2 |
none |
|
acc |
↑ |
0.2464 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2485 |
± |
0.0063 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.3175 |
± |
0.0416 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2364 |
± |
0.0332 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2402 |
± |
0.0300 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2785 |
± |
0.0292 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.2314 |
± |
0.0385 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2407 |
± |
0.0413 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2086 |
± |
0.0319 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.2081 |
± |
0.0219 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2693 |
± |
0.0148 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.1961 |
± |
0.0226 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2284 |
± |
0.0234 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2529 |
± |
0.0111 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.2982 |
± |
0.0351 |
- other |
2 |
none |
|
acc |
↑ |
0.2536 |
± |
0.0078 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2700 |
± |
0.0446 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2264 |
± |
0.0258 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2312 |
± |
0.0321 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1500 |
± |
0.0359 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.2242 |
± |
0.0280 |
- management |
1 |
none |
0 |
acc |
↑ |
0.1942 |
± |
0.0392 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.3034 |
± |
0.0301 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2401 |
± |
0.0153 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.2255 |
± |
0.0239 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2730 |
± |
0.0266 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.4081 |
± |
0.0299 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2289 |
± |
0.0327 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2535 |
± |
0.0079 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2368 |
± |
0.0400 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.2323 |
± |
0.0301 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2539 |
± |
0.0314 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.2436 |
± |
0.0218 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2311 |
± |
0.0274 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2550 |
± |
0.0187 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2824 |
± |
0.0395 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2484 |
± |
0.0175 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2727 |
± |
0.0427 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.2939 |
± |
0.0292 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2488 |
± |
0.0306 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- stem |
2 |
none |
|
acc |
↑ |
0.2293 |
± |
0.0075 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.2519 |
± |
0.0375 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2697 |
± |
0.0361 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2500 |
± |
0.0362 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2400 |
± |
0.0429 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2000 |
± |
0.0402 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.2647 |
± |
0.0439 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2340 |
± |
0.0277 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2414 |
± |
0.0357 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.1931 |
± |
0.0203 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.2323 |
± |
0.0240 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2266 |
± |
0.0295 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2400 |
± |
0.0429 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2037 |
± |
0.0246 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.2185 |
± |
0.0337 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.1898 |
± |
0.0267 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.3393 |
± |
0.0449 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.5061 |
± |
0.0167 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.4933 |
± |
0.0141 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2464 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2485 |
± |
0.0063 |
- other |
2 |
none |
|
acc |
↑ |
0.2536 |
± |
0.0078 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2535 |
± |
0.0079 |
- stem |
2 |
none |
|
acc |
↑ |
0.2293 |
± |
0.0075 |
litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh_boolean_expressions |
1 |
none |
3 |
acc_norm |
↑ |
0.4600 |
± |
0.0316 |
- leaderboard_bbh_causal_judgement |
1 |
none |
3 |
acc_norm |
↑ |
0.5134 |
± |
0.0366 |
- leaderboard_bbh_date_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.1360 |
± |
0.0217 |
- leaderboard_bbh_disambiguation_qa |
1 |
none |
3 |
acc_norm |
↑ |
0.2960 |
± |
0.0289 |
- leaderboard_bbh_formal_fallacies |
1 |
none |
3 |
acc_norm |
↑ |
0.4760 |
± |
0.0316 |
- leaderboard_bbh_geometric_shapes |
1 |
none |
3 |
acc_norm |
↑ |
0.0800 |
± |
0.0172 |
- leaderboard_bbh_hyperbaton |
1 |
none |
3 |
acc_norm |
↑ |
0.5120 |
± |
0.0317 |
- leaderboard_bbh_logical_deduction_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1760 |
± |
0.0241 |
- leaderboard_bbh_logical_deduction_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1320 |
± |
0.0215 |
- leaderboard_bbh_logical_deduction_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3160 |
± |
0.0295 |
- leaderboard_bbh_movie_recommendation |
1 |
none |
3 |
acc_norm |
↑ |
0.2480 |
± |
0.0274 |
- leaderboard_bbh_navigate |
1 |
none |
3 |
acc_norm |
↑ |
0.4200 |
± |
0.0313 |
- leaderboard_bbh_object_counting |
1 |
none |
3 |
acc_norm |
↑ |
0.0360 |
± |
0.0118 |
- leaderboard_bbh_penguins_in_a_table |
1 |
none |
3 |
acc_norm |
↑ |
0.1986 |
± |
0.0331 |
- leaderboard_bbh_reasoning_about_colored_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.0520 |
± |
0.0141 |
- leaderboard_bbh_ruin_names |
1 |
none |
3 |
acc_norm |
↑ |
0.2760 |
± |
0.0283 |
- leaderboard_bbh_salient_translation_error_detection |
1 |
none |
3 |
acc_norm |
↑ |
0.1400 |
± |
0.0220 |
- leaderboard_bbh_snarks |
1 |
none |
3 |
acc_norm |
↑ |
0.4326 |
± |
0.0372 |
- leaderboard_bbh_sports_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.4600 |
± |
0.0316 |
- leaderboard_bbh_temporal_sequences |
1 |
none |
3 |
acc_norm |
↑ |
0.2680 |
± |
0.0281 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.2040 |
± |
0.0255 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1640 |
± |
0.0235 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3840 |
± |
0.0308 |
- leaderboard_bbh_web_of_lies |
1 |
none |
3 |
acc_norm |
↑ |
0.4880 |
± |
0.0317 |
- leaderboard_gpqa |
N/A |
|
|
|
|
|
|
|
- leaderboard_gpqa_diamond |
1 |
none |
0 |
acc_norm |
↑ |
0.2778 |
± |
0.0319 |
- leaderboard_gpqa_extended |
1 |
none |
0 |
acc_norm |
↑ |
0.2766 |
± |
0.0192 |
- leaderboard_gpqa_main |
1 |
none |
0 |
acc_norm |
↑ |
0.2031 |
± |
0.0190 |
- leaderboard_ifeval |
3 |
none |
0 |
inst_level_loose_acc |
↑ |
0.1811 |
± |
N/A |
|
|
none |
0 |
inst_level_strict_acc |
↑ |
0.1715 |
± |
N/A |
|
|
none |
0 |
prompt_level_loose_acc |
↑ |
0.1091 |
± |
0.0134 |
|
|
none |
0 |
prompt_level_strict_acc |
↑ |
0.1035 |
± |
0.0131 |
- leaderboard_math_hard |
N/A |
|
|
|
|
|
|
|
- leaderboard_math_algebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_counting_and_prob_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_geometry_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_intermediate_algebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_num_theory_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_prealgebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_precalculus_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_mmlu_pro |
0.1 |
none |
5 |
acc |
↑ |
0.1169 |
± |
0.0029 |
- leaderboard_musr |
N/A |
|
|
|
|
|
|
|
- leaderboard_musr_murder_mysteries |
1 |
none |
0 |
acc_norm |
↑ |
0.5080 |
± |
0.0317 |
- leaderboard_musr_object_placements |
1 |
none |
0 |
acc_norm |
↑ |
0.3008 |
± |
0.0287 |
- leaderboard_musr_team_allocation |
1 |
none |
0 |
acc_norm |
↑ |
0.3760 |
± |
0.0307 |
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0190 |
± |
0.0038 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
mathqa |
1 |
none |
0 |
acc |
↑ |
0.2060 |
± |
0.0074 |
|
|
none |
0 |
acc_norm |
↑ |
0.2057 |
± |
0.0074 |
litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2480 |
± |
0.0063 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.3175 |
± |
0.0416 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2424 |
± |
0.0335 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2402 |
± |
0.0300 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2743 |
± |
0.0290 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.2314 |
± |
0.0385 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2315 |
± |
0.0408 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2209 |
± |
0.0326 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.2081 |
± |
0.0219 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2670 |
± |
0.0148 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.2090 |
± |
0.0231 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2160 |
± |
0.0229 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2516 |
± |
0.0111 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.3041 |
± |
0.0353 |
- other |
2 |
none |
|
acc |
↑ |
0.2549 |
± |
0.0078 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2700 |
± |
0.0446 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2264 |
± |
0.0258 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2428 |
± |
0.0327 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1600 |
± |
0.0368 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.2242 |
± |
0.0280 |
- management |
1 |
none |
0 |
acc |
↑ |
0.1845 |
± |
0.0384 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.2949 |
± |
0.0299 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2478 |
± |
0.0154 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.2353 |
± |
0.0243 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2553 |
± |
0.0260 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.4118 |
± |
0.0299 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2229 |
± |
0.0324 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2525 |
± |
0.0078 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2368 |
± |
0.0400 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.2172 |
± |
0.0294 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2539 |
± |
0.0314 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.2410 |
± |
0.0217 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2311 |
± |
0.0274 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2495 |
± |
0.0186 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2824 |
± |
0.0395 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2565 |
± |
0.0177 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2636 |
± |
0.0422 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.2898 |
± |
0.0290 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2537 |
± |
0.0308 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- stem |
2 |
none |
|
acc |
↑ |
0.2274 |
± |
0.0075 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.2444 |
± |
0.0371 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2697 |
± |
0.0361 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2500 |
± |
0.0362 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2100 |
± |
0.0409 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.2549 |
± |
0.0434 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2298 |
± |
0.0275 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2483 |
± |
0.0360 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.1931 |
± |
0.0203 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.2258 |
± |
0.0238 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2217 |
± |
0.0292 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2400 |
± |
0.0429 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2074 |
± |
0.0247 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.2185 |
± |
0.0337 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.1991 |
± |
0.0272 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.3393 |
± |
0.0449 |
mmlu_pro |
2 |
custom-extract |
|
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- biology |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- business |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- chemistry |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- computer_science |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- economics |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- engineering |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- health |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- history |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- law |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- math |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- other |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- philosophy |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- physics |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- psychology |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2480 |
± |
0.0063 |
- other |
2 |
none |
|
acc |
↑ |
0.2549 |
± |
0.0078 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2525 |
± |
0.0078 |
- stem |
2 |
none |
|
acc |
↑ |
0.2274 |
± |
0.0075 |
mmlu_pro |
2 |
custom-extract |
|
exact_match |
↑ |
0.0000 |
± |
0.0000 |
litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.2176 |
± |
0.0121 |
|
|
none |
0 |
acc_norm |
↑ |
0.2560 |
± |
0.0128 |
boolq |
2 |
none |
0 |
acc |
↑ |
0.3783 |
± |
0.0085 |
gpqa_diamond_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0051 |
± |
0.0051 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0051 |
± |
0.0051 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0051 |
± |
0.0051 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_n_shot |
2 |
none |
0 |
acc |
↑ |
0.1970 |
± |
0.0283 |
|
|
none |
0 |
acc_norm |
↑ |
0.1970 |
± |
0.0283 |
gpqa_diamond_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.2727 |
± |
0.0317 |
|
|
none |
0 |
acc_norm |
↑ |
0.2727 |
± |
0.0317 |
gpqa_extended_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0018 |
± |
0.0018 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0037 |
± |
0.0026 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0073 |
± |
0.0037 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_n_shot |
2 |
none |
0 |
acc |
↑ |
0.2564 |
± |
0.0187 |
|
|
none |
0 |
acc_norm |
↑ |
0.2564 |
± |
0.0187 |
gpqa_extended_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.2802 |
± |
0.0192 |
|
|
none |
0 |
acc_norm |
↑ |
0.2802 |
± |
0.0192 |
gpqa_main_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0089 |
± |
0.0044 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_n_shot |
2 |
none |
0 |
acc |
↑ |
0.2478 |
± |
0.0204 |
|
|
none |
0 |
acc_norm |
↑ |
0.2478 |
± |
0.0204 |
gpqa_main_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.2143 |
± |
0.0194 |
|
|
none |
0 |
acc_norm |
↑ |
0.2143 |
± |
0.0194 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2618 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2592 |
± |
0.0044 |
openbookqa |
1 |
none |
0 |
acc |
↑ |
0.1340 |
± |
0.0152 |
|
|
none |
0 |
acc_norm |
↑ |
0.2340 |
± |
0.0190 |
piqa |
1 |
none |
0 |
acc |
↑ |
0.5201 |
± |
0.0117 |
|
|
none |
0 |
acc_norm |
↑ |
0.5076 |
± |
0.0117 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.5061 |
± |
0.0167 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.4933 |
± |
0.0141 |
litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
qasper_bool |
1 |
none |
0 |
f1 |
↑ |
0.0000 |
± |
0 |
qasper_freeform |
2 |
none |
0 |
f1_abstractive |
↑ |
0.0036 |
± |
0.001 |
wikitext |
2 |
none |
0 |
bits_per_byte |
↓ |
3.0634 |
± |
N/A |
|
|
none |
0 |
byte_perplexity |
↓ |
8.3596 |
± |
N/A |
|
|
none |
0 |
word_perplexity |
↓ |
85375.3002 |
± |
N/A |
Continued Pretrain Evaluation
lm-evaluation-harness
litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-contrain-quick/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.2142 |
± |
0.0120 |
|
|
none |
0 |
acc_norm |
↑ |
0.2551 |
± |
0.0127 |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0136 |
± |
0.0032 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2626 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2594 |
± |
0.0044 |
mmlu |
2 |
none |
|
acc |
↑ |
0.2441 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2417 |
± |
0.0062 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.2937 |
± |
0.0407 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2182 |
± |
0.0323 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2402 |
± |
0.0300 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2700 |
± |
0.0289 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.1901 |
± |
0.0358 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2778 |
± |
0.0433 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2086 |
± |
0.0319 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.2110 |
± |
0.0220 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2704 |
± |
0.0149 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.1897 |
± |
0.0223 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2130 |
± |
0.0228 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2445 |
± |
0.0110 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.2690 |
± |
0.0340 |
- other |
2 |
none |
|
acc |
↑ |
0.2546 |
± |
0.0078 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2491 |
± |
0.0266 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2543 |
± |
0.0332 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.2287 |
± |
0.0282 |
- management |
1 |
none |
0 |
acc |
↑ |
0.2233 |
± |
0.0412 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.2863 |
± |
0.0296 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.2100 |
± |
0.0409 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2197 |
± |
0.0148 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.2680 |
± |
0.0254 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2624 |
± |
0.0262 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.3824 |
± |
0.0295 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2530 |
± |
0.0338 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2428 |
± |
0.0077 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2456 |
± |
0.0405 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.2323 |
± |
0.0301 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2383 |
± |
0.0307 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.2385 |
± |
0.0216 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2017 |
± |
0.0261 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2550 |
± |
0.0187 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2748 |
± |
0.0392 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2386 |
± |
0.0172 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2545 |
± |
0.0417 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.2531 |
± |
0.0278 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2587 |
± |
0.0310 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.2300 |
± |
0.0423 |
- stem |
2 |
none |
|
acc |
↑ |
0.2388 |
± |
0.0076 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.2074 |
± |
0.0350 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2632 |
± |
0.0358 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2361 |
± |
0.0355 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2500 |
± |
0.0435 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.3300 |
± |
0.0473 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2100 |
± |
0.0409 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.3039 |
± |
0.0458 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2681 |
± |
0.0290 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2621 |
± |
0.0366 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2196 |
± |
0.0213 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.2484 |
± |
0.0246 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.1823 |
± |
0.0272 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2111 |
± |
0.0249 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.1987 |
± |
0.0326 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.2130 |
± |
0.0279 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.3393 |
± |
0.0449 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.5067 |
± |
0.0167 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.4759 |
± |
0.0140 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2441 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2417 |
± |
0.0062 |
- other |
2 |
none |
|
acc |
↑ |
0.2546 |
± |
0.0078 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2428 |
± |
0.0077 |
- stem |
2 |
none |
|
acc |
↑ |
0.2388 |
± |
0.0076 |
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-contrain-math/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0136 |
± |
0.0032 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
mathqa |
1 |
none |
0 |
acc |
↑ |
0.2023 |
± |
0.0074 |
|
|
none |
0 |
acc_norm |
↑ |
0.1977 |
± |
0.0073 |