The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B they evaluate sampling strategies, scaling laws for models and datasets, video representation and more! > The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models > They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal > They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2 they find google/siglip-so400m-patch14-384 to be most powerful 🔥 > they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥
Multimodal 🖼️ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants 👏 > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license ✨ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs 💬 > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license 🔥 > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! 🔥 nearly 8TB pretraining data in many languages!
Image/Video Generation 🖼️ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio 🔊 > Indic-Parler-TTS is a new text2speech model made by community
New InternVL drop with a state-of-the-art 78B vision language model with MIT license 🔥 https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes 78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks 👏 Try here OpenGVLab/InternVL
small but mighty 🔥 you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work 💝 https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):
- There will be the first major public protest related to AI - A big company will see its market cap divided by two or more because of AI - At least 100,000 personal AI robots will be pre-ordered - China will start to lead the AI race (as a consequence of leading the open-source AI race). - There will be big breakthroughs in AI for biology and chemistry. - We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.
How my predictions for 2024 turned out:
- A hyped AI company will go bankrupt or get acquired for a ridiculously low price ✅ (Inflexion, AdeptAI,...)
- Open-source LLMs will reach the level of the best closed-source LLMs ✅ with QwQ and dozens of others
- Big breakthroughs in AI for video, time-series, biology and chemistry ✅ for video 🔴for time-series, biology and chemistry
- We will talk much more about the cost (monetary and environmental) of AI ✅Monetary 🔴Environmental (😢)
- A popular media will be mostly AI-generated ✅ with NotebookLM by Google
- 10 millions AI builders on Hugging Face leading to no increase of unemployment 🔜currently 7M of AI builders on Hugging Face
🖼️ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model 💗 > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents 🤖 > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model📖 > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT 📕
💬 LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet 🔥 > AliBaba has released Marco-o1, a new open-source reasoning model 💥 > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
⏯️ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens