Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

Jaahas Jaahas

AI & ML interests

Recent Activity

Organizations

jaahas's activity

Ultimate Vocal Remover WebUI

Create scripts/

Create scripts/

GenAI Arena

Qwen2 Audio Instruct Demo

FLUX.1 [Schnell]

Vocal Separation SOTA

jaahas/tiddle-gguf

jaahas/tiddle

jaahas/tiddle

jaahas/gemma-2-9b-it-abliterated-Q5_K_M-GGUF