High ARC and TruthfulQA scores
Hello, I noticed that tigerbot-70b-chat-v2 had much higher ARC and TruthfulQA scores than other llama2-70b finetunes. I was wondering if there could have been a contamination issue with this version?
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Hi Christopher,
Thanks for your interests and comments. We did not use any test data in our training pipeline, as we use the model as our official chat product, and usually benchmark data distribution is very different from real user behavior. But during training, we use a sample of train and valid split data for fast evaluation. Also, we use annotated real user prompts for iterative alignments. Our initial check did not find data leakage. We will further investigation, but you raise a good observation, we probably will remove the fast evaluation and check into the real user data for leakage in our later versions.
As a reference, this version is not a direct fine-tune from llama-2, instead its a fine-tune and rlhf chat model from our own base model: tigerbot-70b-base-v1. This base model was continual pretrained from llama-2 on 300B token data for three months on a 500x a100 cluster. the pretrain data is unsupervised text, and this base model can perform quite well already (avg 65 in hf leaderboard). we checked its evaluation results also on opencampass (https://opencompass.org.cn/model-detail/TigerBot-70B-Base-V1), the arc score already 82-87. the variance from hf is likely due to the evaluation scripts prompting difference, but this suggest it's less likely data leakage during fine-tuning.
as a side note, our model got evaluated on opencampass: https://opencompass.org.cn/leaderboard-llm (from a Chinese national lab who evaluated on about 40 benchmarks) and clib: https://github.com/jeinlee1991/chinese-llm-benchmark#%E5%8E%9F%E5%A7%8B%E8%AF%84%E6%B5%8B%E6%95%B0%E6%8D%AE (crowdsourced), and performed quite well.
Nevertheless, thanks for your observations and raising the issue, we surely will put more efforts to avoid data contamination.
It's not Arc or Truthful but I noticed that the SFT dataset (https://huggingface.co/datasets/TigerResearch/sft_en) shared by the developers of this model contains GSM8K train data as well.
If the SFT dataset that is currently shared was used during the fine-tuning stage of the released TigerBot chat models, won't it have contamination with the GSM8K task?
HI killawhale2,
Thanks for raising this. That data was uploaded five months ago as an example, our latest iterations have moved away from that data. Also, our production process has dedup step to filter out any overlap with evaluation data to avoid overfitting. anyway, we will clean that upload and maybe upload an more fresh sample. thanks.