Spaces:
Running
on
CPU Upgrade
Wrong results or am i understanding something wrong?
Hi,
I am currently looking at some results of the new leaderboard and some parts of them i do not understand.
I for example looked at following results:
https://huggingface.co/datasets/open-llm-leaderboard/mistralai__Mistral-7B-v0.3-details/blob/main/mistralai__Mistral-7B-v0.3/samples_leaderboard_musr_object_placements_2024-06-16T16-59-40.129004.json
In there i saw for example the following doc:
{
"doc_id": 100,
"doc": {
"narrative": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
"question": "Which location is the most likely place Amy would look to find the laptop given the story?",
"choices": "[\"Amy's bag\", \"Steve's desk\", 'meeting room', 'storage room']",
"answer_index": 1,
"answer_choice": "Steve's desk"
},
"target": "Steve's desk",
"arguments": {
"gen_args_0": {
"arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
"arg_1": " Amy's bag"
},
"gen_args_1": {
"arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
"arg_1": " Steve's desk"
},
"gen_args_2": {
"arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
"arg_1": " meeting room"
},
"gen_args_3": {
"arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
"arg_1": " storage room"
}
},
"resps": [
[
[
"-4.742625713348389",
"False"
]
],
[
[
"-5.4493818283081055",
"False"
]
],
[
[
"-8.597461700439453",
"False"
]
],
[
[
"-7.260561466217041",
"False"
]
]
],
"filtered_resps": [
[
"-4.742625713348389",
"False"
],
[
"-5.4493818283081055",
"False"
],
[
"-8.597461700439453",
"False"
],
[
"-7.260561466217041",
"False"
]
],
"doc_hash": "1ff558c6b1587662ad68ccd0f38192193dddd6376f35f96cee6f445c4b4c26a6",
"prompt_hash": "30efa381b5d2f7a8d22c15caeb5cb192cf7ca3b0481be33bf0e50f7a474c8668",
"target_hash": "78fb1574492fbfb833e674311451bfff46b7377e0c1824b94bf4b1ddc84d0039",
"acc_norm": 1.0
}
The answer index is 1.
When looking at the resps values the highest value is the "-4.742625713348389" which is the index 0.
I thought that the highest value determines the answer of the model?
So i would have thought that the answer of the model is index 0.
Why is the acc_norm than 1.0?
There are two types of indexing: zero-based, starting from 0, and one-based, starting from 1. I think this might be just using a zero-based index when you were expecting it to start from zero. I might be wrong though, so let's wait for an answer from one of the maintainers.
There are two types of indexing: zero-based, starting from 0, and one-based, starting from 1. I think this might be just using a zero-based index when you were expecting it to start from zero. I might be wrong though, so let's wait for an answer from one of the maintainers.
I do not think that this is the explanation, but yes you are totaly right, lets wait of an official answer.
Here is another example out of https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-70B-Instruct-details/blob/main/meta-llama__Meta-Llama-3-70B-Instruct/samples_leaderboard_musr_murder_mysteries_2024-06-19T08-22-58.348428.json, which does not make sense to me.
{
"doc_id": 0,
"doc": {
"narrative": "In an adrenaline inducing bungee jumping site, ...",
"question": "Who is the most likely murderer?",
"choices": "['Mackenzie', 'Ana']",
"answer_index": 0,
"answer_choice": "Mackenzie"
},
"target": "Mackenzie",
"arguments": {
"gen_args_0": {
"arg_0": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIn an adrenaline inducing bungee jumping site, ...",
"arg_1": "Mackenzie"
},
"gen_args_1": {
"arg_0": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIn an adrenaline inducing bungee jumping site, ...",
"arg_1": "Ana"
}
},
"resps": [
[
[
"-18.321674346923828",
"False"
]
],
[
[
"-17.761363983154297",
"False"
]
]
],
"filtered_resps": [
[
"-18.321674346923828",
"False"
],
[
"-17.761363983154297",
"False"
]
],
"doc_hash": "5f1aa1c93592052d09fd5c2269624f7f6502e7a0a449eaedade303f15e4f9a7e",
"prompt_hash": "eba5abc36b0f013ee9ad59846c5732be9d46917d24400f110b41e7dcdf3c34b4",
"target_hash": "5e0a11a6f7067982f903e924b45692ab48c7224f794799148ed9bc3b6fc1e340",
"acc_norm": 1.0
}
Answer index is 0. The higher value in resps is "-17.761363983154297" which is at index 1. So i would think that the model predicts 1. But the acc is again 1.0.
Also in the new visualization possibility on https://huggingface.co/spaces/open-llm-leaderboard/blog it seems not fitting.
Iirc, the logprobs that you see displayed here are the sum of the logprobs over the choice tokens, but not yet normalized (on the number of tokens of the choice). (They would correspond to the acc
score).
For the æcc_norm
score, you normalize by the number of tokens. When you do so, Mackenzie
is longer than Ana
for example, so you end up with a smaller normalised logprob, hence a smaller score.
I agree it's not super legible though.
@clefourrier Thank you for the response :) Now i do understand it! I am new to the leaderboard and did not have the information. Is there any documentation available on this area from which i could have known this?
We're working on improving our doc, I don't think it's there yet - @alozowski do you think it would make sense to add to the FAQ?
Hi everyone!
Here is the new Scores Normalization page in our documentation – please, check it out
Hi @alozowski :) Thank you for your comment and the hint for the new page :) This is a very helpful page but it does not cover directly the topic of this discussion. The confusion here was due to the normalization of the logprobs based on the number of tokens of the choice.
Yes, I'll add more information there on this topic, you can check this documentation page from time to time to follow the updates!
I think I can close this discussion for now, please, feel free to open a new one in case of any other questions :)