open-llm-leaderboard/open_llm_leaderboard · 14B model detected as 7B

Dec 23, 2024

I've been working on merging a 14 billion parameter model recently, but when it comes time to evaluate the model, the system indicates that the model has only 7 billion parameter instead of the expected 14 billion. It's funny when the top 7 billion model is actually 14 billion

alozowski

Open LLM Leaderboard org Dec 23, 2024

Hi @djuna ,

Could you please provide the request file for the model you submitted so we will be able to check the number of parameters?

djuna

Dec 23, 2024

@alozowski this one
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/djuna/Q2.5-Veltha-14B-0.5_eval_request_False_bfloat16_Original.json

djuna

Dec 23, 2024

When you filter 7-8B size on the spaces, more than 10+ model is actually 14B

brankor-mcom

Dec 23, 2024

There are quite a few models in the leaderboard where the indicated size is half the actual size:

maldv/Qwentile2.5-32B-Instruct
CultriX/Qwen2.5-14B-Wernickev3

...and many others, most of them Qwen-derived.

clefourrier

Open LLM Leaderboard org Dec 26, 2024

Hi! Thanks for the report!
We extract the number of parameters from the safetensors files automatically, in theory - @alozowski will be able to investigate why there is a mismatch when she comes back from vacations

sometimesanotion

Dec 31, 2024

Interestingly, I'm seeing the same thing. Djuna's models and my own are merging similar base models. Also, the scores for my models on the leaderboard vary greatly from what the comparator shows for them. I believe the comparator is accurate.

clefourrier

Open LLM Leaderboard org Dec 31, 2024

For the difference between the comparator and leaderboard, make sure you compare either raw or normalised scores on both (we have 2 ways to compute scores, it should be explained in the FAQ)

CombinHorizon

Jan 12

•

edited Jan 12

there are some models by sometimesanotion
all of those models are deleted/unavailable

some of the request files:
Qwen2.5-14B-Vimarckoso-v3-model_stock
Lamarck-14B-v0.6-model_stock
Qwen2.5-14B-Vimarckoso-v3-Prose01
Qwentinuum-14B-v5

and there are a dozen or so more similar

Would a way to: automatically flag a model for closer/manual inspection, when its model name and auto-detected size differ significantly - would that help much?

alozowski

Open LLM Leaderboard org Jan 13

Hi everyone!

Thank you all for bringing this to our attention. The parameter numbers issue was fixed last week. If a model was accessible on Hub, its parameters were recalculated to the correct value, and I hope there are no more such errors

I'm closing this issue, but please feel free to ping me here if you have any other questions or open a new one!

alozowski changed discussion status to closed Jan 13

brankor-mcom

Jan 13

@alozowski , unfortunately the following models (none of them available on the hub), still have the wrong size:

 [1] "sometimesanotion/Lamarck-14B-v0.6-002-model_stock"      "sometimesanotion/Lamarck-14B-v0.6-model_stock"         
 [3] "sometimesanotion/Qwen-14B-ProseStock-v4"                "sometimesanotion/Qwen2.5-14B-Vimarckoso-v2"            
 [5] "sometimesanotion/Qwen2.5-14B-Vimarckoso-v3-IF-Variant"  "sometimesanotion/Qwen2.5-14B-Vimarckoso-v3-Prose01"    
 [7] "sometimesanotion/Qwen2.5-14B-Vimarckoso-v3-model_stock" "sometimesanotion/Qwentinuum-14B-v013"                  
 [9] "sometimesanotion/Qwentinuum-14B-v1"                     "sometimesanotion/Qwentinuum-14B-v2"                    
[11] "sometimesanotion/Qwentinuum-14B-v3"                     "sometimesanotion/Qwentinuum-14B-v5"                    
[13] "sometimesanotion/Qwentinuum-14B-v6"                     "sometimesanotion/Qwentinuum-14B-v6-Prose"              
[15] "sometimesanotion/Qwentinuum-14B-v7"                     "sometimesanotion/Qwentinuum-14B-v8"                    
[17] "sometimesanotion/Qwentinuum-14B-v9"                     "sometimesanotion/Qwenvergence-14B-qv256"               
[19] "sometimesanotion/Qwenvergence-14B-v0.6-004-model_stock" "sometimesanotion/Qwenvergence-14B-v3"                  
[21] "sometimesanotion/Qwenvergence-14B-v3-Reason"            "sometimesanotion/Qwenvergence-14B-v3-Reason"           
[23] "sometimesanotion/Qwenvergence-14B-v6-Prose"

Also this one: sometimesanotion/IF-reasoning-experiment-40, so at least 24 in total, all 14B indicated as 7b. Looks like unsloth/phi-4-unsloth-bnb-4bit too, I don't think phi-4 has 8b params.

When the leaderboard is filtered by size (-1 to 10: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?params=-1%2C10), these swamp the top places which is rather unfortunate, as it makes the rankings unusable. (The actual best model with <10b params is tiiuae/Falcon3-7B-Instruct, and it's not in the top 20.) For the integrity of the list, would it be possible to manually fix these at least to the approximately correct values?

alozowski

Open LLM Leaderboard org Jan 14

Hi @brankor-mcom ,

Thanks for the list of models! I corrected them manually, and we will check the parameter correctness for the inaccessible models.

For unsloth/phi-4-unsloth-bnb-4bit – everythings is correct