tangled-llama-a-128k-base-v0.1

A pretrained language model based on the Llama model with about 62.9M parameters. This model has been trained on 10.6B (10,630,121,844) tokens from more than 31.3M (31,383,840) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.2176	±	0.0121
		none	0	acc_norm	↑	0.2560	±	0.0128
gsm8k	3	flexible-extract	5	exact_match	↑	0.0190	±	0.0038
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag	1	none	0	acc	↑	0.2618	±	0.0044
		none	0	acc_norm	↑	0.2592	±	0.0044
mmlu	2	none		acc	↑	0.2464	±	0.0036
- humanities	2	none		acc	↑	0.2485	±	0.0063
- formal_logic	1	none	0	acc	↑	0.3175	±	0.0416
- high_school_european_history	1	none	0	acc	↑	0.2364	±	0.0332
- high_school_us_history	1	none	0	acc	↑	0.2402	±	0.0300
- high_school_world_history	1	none	0	acc	↑	0.2785	±	0.0292
- international_law	1	none	0	acc	↑	0.2314	±	0.0385
- jurisprudence	1	none	0	acc	↑	0.2407	±	0.0413
- logical_fallacies	1	none	0	acc	↑	0.2086	±	0.0319
- moral_disputes	1	none	0	acc	↑	0.2081	±	0.0219
- moral_scenarios	1	none	0	acc	↑	0.2693	±	0.0148
- philosophy	1	none	0	acc	↑	0.1961	±	0.0226
- prehistory	1	none	0	acc	↑	0.2284	±	0.0234
- professional_law	1	none	0	acc	↑	0.2529	±	0.0111
- world_religions	1	none	0	acc	↑	0.2982	±	0.0351
- other	2	none		acc	↑	0.2536	±	0.0078
- business_ethics	1	none	0	acc	↑	0.2700	±	0.0446
- clinical_knowledge	1	none	0	acc	↑	0.2264	±	0.0258
- college_medicine	1	none	0	acc	↑	0.2312	±	0.0321
- global_facts	1	none	0	acc	↑	0.1500	±	0.0359
- human_aging	1	none	0	acc	↑	0.2242	±	0.0280
- management	1	none	0	acc	↑	0.1942	±	0.0392
- marketing	1	none	0	acc	↑	0.3034	±	0.0301
- medical_genetics	1	none	0	acc	↑	0.2200	±	0.0416
- miscellaneous	1	none	0	acc	↑	0.2401	±	0.0153
- nutrition	1	none	0	acc	↑	0.2255	±	0.0239
- professional_accounting	1	none	0	acc	↑	0.2730	±	0.0266
- professional_medicine	1	none	0	acc	↑	0.4081	±	0.0299
- virology	1	none	0	acc	↑	0.2289	±	0.0327
- social sciences	2	none		acc	↑	0.2535	±	0.0079
- econometrics	1	none	0	acc	↑	0.2368	±	0.0400
- high_school_geography	1	none	0	acc	↑	0.2323	±	0.0301
- high_school_government_and_politics	1	none	0	acc	↑	0.2539	±	0.0314
- high_school_macroeconomics	1	none	0	acc	↑	0.2436	±	0.0218
- high_school_microeconomics	1	none	0	acc	↑	0.2311	±	0.0274
- high_school_psychology	1	none	0	acc	↑	0.2550	±	0.0187
- human_sexuality	1	none	0	acc	↑	0.2824	±	0.0395
- professional_psychology	1	none	0	acc	↑	0.2484	±	0.0175
- public_relations	1	none	0	acc	↑	0.2727	±	0.0427
- security_studies	1	none	0	acc	↑	0.2939	±	0.0292
- sociology	1	none	0	acc	↑	0.2488	±	0.0306
- us_foreign_policy	1	none	0	acc	↑	0.2800	±	0.0451
- stem	2	none		acc	↑	0.2293	±	0.0075
- abstract_algebra	1	none	0	acc	↑	0.2200	±	0.0416
- anatomy	1	none	0	acc	↑	0.2519	±	0.0375
- astronomy	1	none	0	acc	↑	0.2697	±	0.0361
- college_biology	1	none	0	acc	↑	0.2500	±	0.0362
- college_chemistry	1	none	0	acc	↑	0.2400	±	0.0429
- college_computer_science	1	none	0	acc	↑	0.2800	±	0.0451
- college_mathematics	1	none	0	acc	↑	0.2000	±	0.0402
- college_physics	1	none	0	acc	↑	0.2647	±	0.0439
- computer_security	1	none	0	acc	↑	0.1900	±	0.0394
- conceptual_physics	1	none	0	acc	↑	0.2340	±	0.0277
- electrical_engineering	1	none	0	acc	↑	0.2414	±	0.0357
- elementary_mathematics	1	none	0	acc	↑	0.1931	±	0.0203
- high_school_biology	1	none	0	acc	↑	0.2323	±	0.0240
- high_school_chemistry	1	none	0	acc	↑	0.2266	±	0.0295
- high_school_computer_science	1	none	0	acc	↑	0.2400	±	0.0429
- high_school_mathematics	1	none	0	acc	↑	0.2037	±	0.0246
- high_school_physics	1	none	0	acc	↑	0.2185	±	0.0337
- high_school_statistics	1	none	0	acc	↑	0.1898	±	0.0267
- machine_learning	1	none	0	acc	↑	0.3393	±	0.0449
truthfulqa_mc2	2	none	0	acc	↑	0.5061	±	0.0167
winogrande	1	none	0	acc	↑	0.4933	±	0.0141

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2464	±	0.0036
- humanities	2	none	acc	↑	0.2485	±	0.0063
- other	2	none	acc	↑	0.2536	±	0.0078
- social sciences	2	none	acc	↑	0.2535	±	0.0079
- stem	2	none	acc	↑	0.2293	±	0.0075

litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A
- leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5134	±	0.0366
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.1360	±	0.0217
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.2960	±	0.0289
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.4760	±	0.0316
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.0800	±	0.0172
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.5120	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.1760	±	0.0241
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.1320	±	0.0215
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.3160	±	0.0295
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.2480	±	0.0274
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.4200	±	0.0313
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.0360	±	0.0118
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.1986	±	0.0331
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.0520	±	0.0141
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.2760	±	0.0283
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.1400	±	0.0220
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.4326	±	0.0372
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2680	±	0.0281
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.2040	±	0.0255
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1640	±	0.0235
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3840	±	0.0308
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.4880	±	0.0317
- leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2778	±	0.0319
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2766	±	0.0192
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2031	±	0.0190
- leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.1811	±	N/A
		none	0	inst_level_strict_acc	↑	0.1715	±	N/A
		none	0	prompt_level_loose_acc	↑	0.1091	±	0.0134
		none	0	prompt_level_strict_acc	↑	0.1035	±	0.0131
- leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_counting_and_prob_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_geometry_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_intermediate_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_num_theory_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_prealgebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_precalculus_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.1169	±	0.0029
- leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5080	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.3008	±	0.0287
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3760	±	0.0307

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0190	±	0.0038
		strict-match	5	exact_match	↑	0.0000	±	0.0000
mathqa	1	none	0	acc	↑	0.2060	±	0.0074
		none	0	acc_norm	↑	0.2057	±	0.0074

litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.2459	±	0.0036
- humanities	2	none		acc	↑	0.2480	±	0.0063
- formal_logic	1	none	0	acc	↑	0.3175	±	0.0416
- high_school_european_history	1	none	0	acc	↑	0.2424	±	0.0335
- high_school_us_history	1	none	0	acc	↑	0.2402	±	0.0300
- high_school_world_history	1	none	0	acc	↑	0.2743	±	0.0290
- international_law	1	none	0	acc	↑	0.2314	±	0.0385
- jurisprudence	1	none	0	acc	↑	0.2315	±	0.0408
- logical_fallacies	1	none	0	acc	↑	0.2209	±	0.0326
- moral_disputes	1	none	0	acc	↑	0.2081	±	0.0219
- moral_scenarios	1	none	0	acc	↑	0.2670	±	0.0148
- philosophy	1	none	0	acc	↑	0.2090	±	0.0231
- prehistory	1	none	0	acc	↑	0.2160	±	0.0229
- professional_law	1	none	0	acc	↑	0.2516	±	0.0111
- world_religions	1	none	0	acc	↑	0.3041	±	0.0353
- other	2	none		acc	↑	0.2549	±	0.0078
- business_ethics	1	none	0	acc	↑	0.2700	±	0.0446
- clinical_knowledge	1	none	0	acc	↑	0.2264	±	0.0258
- college_medicine	1	none	0	acc	↑	0.2428	±	0.0327
- global_facts	1	none	0	acc	↑	0.1600	±	0.0368
- human_aging	1	none	0	acc	↑	0.2242	±	0.0280
- management	1	none	0	acc	↑	0.1845	±	0.0384
- marketing	1	none	0	acc	↑	0.2949	±	0.0299
- medical_genetics	1	none	0	acc	↑	0.2200	±	0.0416
- miscellaneous	1	none	0	acc	↑	0.2478	±	0.0154
- nutrition	1	none	0	acc	↑	0.2353	±	0.0243
- professional_accounting	1	none	0	acc	↑	0.2553	±	0.0260
- professional_medicine	1	none	0	acc	↑	0.4118	±	0.0299
- virology	1	none	0	acc	↑	0.2229	±	0.0324
- social sciences	2	none		acc	↑	0.2525	±	0.0078
- econometrics	1	none	0	acc	↑	0.2368	±	0.0400
- high_school_geography	1	none	0	acc	↑	0.2172	±	0.0294
- high_school_government_and_politics	1	none	0	acc	↑	0.2539	±	0.0314
- high_school_macroeconomics	1	none	0	acc	↑	0.2410	±	0.0217
- high_school_microeconomics	1	none	0	acc	↑	0.2311	±	0.0274
- high_school_psychology	1	none	0	acc	↑	0.2495	±	0.0186
- human_sexuality	1	none	0	acc	↑	0.2824	±	0.0395
- professional_psychology	1	none	0	acc	↑	0.2565	±	0.0177
- public_relations	1	none	0	acc	↑	0.2636	±	0.0422
- security_studies	1	none	0	acc	↑	0.2898	±	0.0290
- sociology	1	none	0	acc	↑	0.2537	±	0.0308
- us_foreign_policy	1	none	0	acc	↑	0.2800	±	0.0451
- stem	2	none		acc	↑	0.2274	±	0.0075
- abstract_algebra	1	none	0	acc	↑	0.2200	±	0.0416
- anatomy	1	none	0	acc	↑	0.2444	±	0.0371
- astronomy	1	none	0	acc	↑	0.2697	±	0.0361
- college_biology	1	none	0	acc	↑	0.2500	±	0.0362
- college_chemistry	1	none	0	acc	↑	0.2100	±	0.0409
- college_computer_science	1	none	0	acc	↑	0.2800	±	0.0451
- college_mathematics	1	none	0	acc	↑	0.1900	±	0.0394
- college_physics	1	none	0	acc	↑	0.2549	±	0.0434
- computer_security	1	none	0	acc	↑	0.1900	±	0.0394
- conceptual_physics	1	none	0	acc	↑	0.2298	±	0.0275
- electrical_engineering	1	none	0	acc	↑	0.2483	±	0.0360
- elementary_mathematics	1	none	0	acc	↑	0.1931	±	0.0203
- high_school_biology	1	none	0	acc	↑	0.2258	±	0.0238
- high_school_chemistry	1	none	0	acc	↑	0.2217	±	0.0292
- high_school_computer_science	1	none	0	acc	↑	0.2400	±	0.0429
- high_school_mathematics	1	none	0	acc	↑	0.2074	±	0.0247
- high_school_physics	1	none	0	acc	↑	0.2185	±	0.0337
- high_school_statistics	1	none	0	acc	↑	0.1991	±	0.0272
- machine_learning	1	none	0	acc	↑	0.3393	±	0.0449
mmlu_pro	2	custom-extract		exact_match	↑	0.0000	±	0.0000
- biology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- business	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- chemistry	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- computer_science	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- economics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- engineering	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- health	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- history	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- law	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- math	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- other	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- philosophy	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- physics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- psychology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2459	±	0.0036
- humanities	2	none	acc	↑	0.2480	±	0.0063
- other	2	none	acc	↑	0.2549	±	0.0078
- social sciences	2	none	acc	↑	0.2525	±	0.0078
- stem	2	none	acc	↑	0.2274	±	0.0075
mmlu_pro	2	custom-extract	exact_match	↑	0.0000	±	0.0000

litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
arc_challenge	1	none	acc	↑	0.2176	±	0.0121
		none	acc_norm	↑	0.2560	±	0.0128
boolq	2	none	acc	↑	0.3783	±	0.0085
gpqa_diamond_cot_n_shot	2	flexible-extract	exact_match	↑	0.0051	±	0.0051
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0051	±	0.0051
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_generative_n_shot	2	flexible-extract	exact_match	↑	0.0051	±	0.0051
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_n_shot	2	none	acc	↑	0.1970	±	0.0283
		none	acc_norm	↑	0.1970	±	0.0283
gpqa_diamond_zeroshot	1	none	acc	↑	0.2727	±	0.0317
		none	acc_norm	↑	0.2727	±	0.0317
gpqa_extended_cot_n_shot	2	flexible-extract	exact_match	↑	0.0018	±	0.0018
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0037	±	0.0026
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_generative_n_shot	2	flexible-extract	exact_match	↑	0.0073	±	0.0037
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_n_shot	2	none	acc	↑	0.2564	±	0.0187
		none	acc_norm	↑	0.2564	±	0.0187
gpqa_extended_zeroshot	1	none	acc	↑	0.2802	±	0.0192
		none	acc_norm	↑	0.2802	±	0.0192
gpqa_main_cot_n_shot	2	flexible-extract	exact_match	↑	0.0000	±	0.0000
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0000	±	0.0000
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_generative_n_shot	2	flexible-extract	exact_match	↑	0.0089	±	0.0044
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_n_shot	2	none	acc	↑	0.2478	±	0.0204
		none	acc_norm	↑	0.2478	±	0.0204
gpqa_main_zeroshot	1	none	acc	↑	0.2143	±	0.0194
		none	acc_norm	↑	0.2143	±	0.0194
hellaswag	1	none	acc	↑	0.2618	±	0.0044
		none	acc_norm	↑	0.2592	±	0.0044
openbookqa	1	none	acc	↑	0.1340	±	0.0152
		none	acc_norm	↑	0.2340	±	0.0190
piqa	1	none	acc	↑	0.5201	±	0.0117
		none	acc_norm	↑	0.5076	±	0.0117
truthfulqa_mc2	2	none	acc	↑	0.5061	±	0.0167
winogrande	1	none	acc	↑	0.4933	±	0.0141

litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
qasper_bool	1	none	f1	↑	0.0000	±	0
qasper_freeform	2	none	f1_abstractive	↑	0.0036	±	0.001
wikitext	2	none	bits_per_byte	↓	3.0634	±	N/A
		none	byte_perplexity	↓	8.3596	±	N/A
		none	word_perplexity	↓	85375.3002	±	N/A

Continued Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-contrain-quick/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.2142	±	0.0120
		none	0	acc_norm	↑	0.2551	±	0.0127
gsm8k	3	flexible-extract	5	exact_match	↑	0.0136	±	0.0032
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag	1	none	0	acc	↑	0.2626	±	0.0044
		none	0	acc_norm	↑	0.2594	±	0.0044
mmlu	2	none		acc	↑	0.2441	±	0.0036
- humanities	2	none		acc	↑	0.2417	±	0.0062
- formal_logic	1	none	0	acc	↑	0.2937	±	0.0407
- high_school_european_history	1	none	0	acc	↑	0.2182	±	0.0323
- high_school_us_history	1	none	0	acc	↑	0.2402	±	0.0300
- high_school_world_history	1	none	0	acc	↑	0.2700	±	0.0289
- international_law	1	none	0	acc	↑	0.1901	±	0.0358
- jurisprudence	1	none	0	acc	↑	0.2778	±	0.0433
- logical_fallacies	1	none	0	acc	↑	0.2086	±	0.0319
- moral_disputes	1	none	0	acc	↑	0.2110	±	0.0220
- moral_scenarios	1	none	0	acc	↑	0.2704	±	0.0149
- philosophy	1	none	0	acc	↑	0.1897	±	0.0223
- prehistory	1	none	0	acc	↑	0.2130	±	0.0228
- professional_law	1	none	0	acc	↑	0.2445	±	0.0110
- world_religions	1	none	0	acc	↑	0.2690	±	0.0340
- other	2	none		acc	↑	0.2546	±	0.0078
- business_ethics	1	none	0	acc	↑	0.2600	±	0.0441
- clinical_knowledge	1	none	0	acc	↑	0.2491	±	0.0266
- college_medicine	1	none	0	acc	↑	0.2543	±	0.0332
- global_facts	1	none	0	acc	↑	0.1900	±	0.0394
- human_aging	1	none	0	acc	↑	0.2287	±	0.0282
- management	1	none	0	acc	↑	0.2233	±	0.0412
- marketing	1	none	0	acc	↑	0.2863	±	0.0296
- medical_genetics	1	none	0	acc	↑	0.2100	±	0.0409
- miscellaneous	1	none	0	acc	↑	0.2197	±	0.0148
- nutrition	1	none	0	acc	↑	0.2680	±	0.0254
- professional_accounting	1	none	0	acc	↑	0.2624	±	0.0262
- professional_medicine	1	none	0	acc	↑	0.3824	±	0.0295
- virology	1	none	0	acc	↑	0.2530	±	0.0338
- social sciences	2	none		acc	↑	0.2428	±	0.0077
- econometrics	1	none	0	acc	↑	0.2456	±	0.0405
- high_school_geography	1	none	0	acc	↑	0.2323	±	0.0301
- high_school_government_and_politics	1	none	0	acc	↑	0.2383	±	0.0307
- high_school_macroeconomics	1	none	0	acc	↑	0.2385	±	0.0216
- high_school_microeconomics	1	none	0	acc	↑	0.2017	±	0.0261
- high_school_psychology	1	none	0	acc	↑	0.2550	±	0.0187
- human_sexuality	1	none	0	acc	↑	0.2748	±	0.0392
- professional_psychology	1	none	0	acc	↑	0.2386	±	0.0172
- public_relations	1	none	0	acc	↑	0.2545	±	0.0417
- security_studies	1	none	0	acc	↑	0.2531	±	0.0278
- sociology	1	none	0	acc	↑	0.2587	±	0.0310
- us_foreign_policy	1	none	0	acc	↑	0.2300	±	0.0423
- stem	2	none		acc	↑	0.2388	±	0.0076
- abstract_algebra	1	none	0	acc	↑	0.2200	±	0.0416
- anatomy	1	none	0	acc	↑	0.2074	±	0.0350
- astronomy	1	none	0	acc	↑	0.2632	±	0.0358
- college_biology	1	none	0	acc	↑	0.2361	±	0.0355
- college_chemistry	1	none	0	acc	↑	0.2500	±	0.0435
- college_computer_science	1	none	0	acc	↑	0.3300	±	0.0473
- college_mathematics	1	none	0	acc	↑	0.2100	±	0.0409
- college_physics	1	none	0	acc	↑	0.3039	±	0.0458
- computer_security	1	none	0	acc	↑	0.2800	±	0.0451
- conceptual_physics	1	none	0	acc	↑	0.2681	±	0.0290
- electrical_engineering	1	none	0	acc	↑	0.2621	±	0.0366
- elementary_mathematics	1	none	0	acc	↑	0.2196	±	0.0213
- high_school_biology	1	none	0	acc	↑	0.2484	±	0.0246
- high_school_chemistry	1	none	0	acc	↑	0.1823	±	0.0272
- high_school_computer_science	1	none	0	acc	↑	0.2200	±	0.0416
- high_school_mathematics	1	none	0	acc	↑	0.2111	±	0.0249
- high_school_physics	1	none	0	acc	↑	0.1987	±	0.0326
- high_school_statistics	1	none	0	acc	↑	0.2130	±	0.0279
- machine_learning	1	none	0	acc	↑	0.3393	±	0.0449
truthfulqa_mc2	2	none	0	acc	↑	0.5067	±	0.0167
winogrande	1	none	0	acc	↑	0.4759	±	0.0140

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2441	±	0.0036
- humanities	2	none	acc	↑	0.2417	±	0.0062
- other	2	none	acc	↑	0.2546	±	0.0078
- social sciences	2	none	acc	↑	0.2428	±	0.0077
- stem	2	none	acc	↑	0.2388	±	0.0076

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-contrain-math/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0136	±	0.0032
		strict-match	5	exact_match	↑	0.0000	±	0.0000
mathqa	1	none	0	acc	↑	0.2023	±	0.0074
		none	0	acc_norm	↑	0.1977	±	0.0073

tangledgroup
/

tangled-llama-a-128k-base-v0.1

tangled-llama-a-128k-base-v0.1

Pretrain Evaluation

lm-evaluation-harness

Continued Pretrain Evaluation

lm-evaluation-harness

Datasets used to train tangledgroup/tangled-llama-a-128k-base-v0.1