12 17 1

Yi Cui PRO

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update 1 day ago

🐋 DeepSeek 🐋v3 achieves a solid 7 point jump than v2.5, surpassing GPT-4o, but is still behind 🍓 o1 🍓and Claude 3.5. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

updated a Space 1 day ago

onekq-ai/WebApp1K-models-leaderboard

updated a Space 1 day ago

onekq-ai/WebApp1K-models-leaderboard

View all activity

Articles

Does Daily Software Engineering Work Need Reasoning Models?

Sep 24

• 5

All LLMs Write Great Code, But Some Make (A Lot) Fewer Mistakes

Sep 12

• 4

Organizations

Posts 7

Post

1588

🐋 DeepSeek 🐋v3 achieves a solid 7 point jump than v2.5, surpassing GPT-4o, but is still behind 🍓 o1 🍓and Claude 3.5.

onekq-ai/WebApp1K-models-leaderboard

Post

574

October version of Claude 3.5 lifts SOTA (set by its June version) by 7 points.
onekq-ai/WebApp1K-models-leaderboard

Closed sourced models are widening the gap again.

Note: Our frontier leaderboard now uses double test scenarios because the single-scenario test suit has been saturated.

View all posts

Papers 3

arxiv:2409.13773

arxiv:2409.05177

arxiv:2408.00019

models

None public yet

datasets

None public yet