🚀 Supercharge your LLM apps with Langfuse on Hugging Face Spaces!
Langfuse brings end-to-end observability and tooling to accelerate your dev workflow from experiments through production
Now available as a Docker Space directly on the HF Hub! 🤗
🔍 Trace everything: monitor LLM calls, retrieval, and agent actions with popular frameworks 1⃣ One-click deployment: on Spaces with persistent storage and integrated OAuth 🛠 Simple Prompt Management: Version, edit, and update without redeployment ✅ Intuitive Evals: Collect user feedback, run model/prompt evaluations, and improve quality 📊 Dataset Creation: Build datasets directly from production data to enhance future performance
Kudos to the Langfuse team for this collab and the awesome, open-first product they’re building! 👏 @marcklingen@Clemo@MJannik
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
Trace LLM calls with Arize AI's Phoenix observability dashboards on Hugging Face Spaces! 🚀
✨ I just added a new recipe to the Open-Source AI Cookbook that shows you how to: 1️⃣ Deploy Phoenix on HF Spaces with persistent storage in a few clicks 2️⃣ Configure LLM tracing with the 𝗦𝗲𝗿𝘃𝗲𝗿𝗹𝗲𝘀𝘀 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗔𝗣𝗜 3️⃣ Observe multi-agent application runs with the CrewAI integration
𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗰𝗿𝘂𝗰𝗶𝗮𝗹 for building robust LLM apps.
Phoenix makes it easy to visualize trace data, evaluate performance, and track down issues. Give it a try!
The outcome is quite sad, as a Frenchman and European.
The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.
American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬
⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
This is no Woodstock AI but will be fun nonetheless haha. I’ll be hosting a live workshop with team members next week about the Enterprise Hugging Face hub.
1,000 spots available first-come first serve with some surprises during the stream!
TL;DR Make your model write "margin notes" as you chunk prefill the KV cache. Then ask it reread all notes before it speaks up. Works with humans, works with AI 🤖
WiM leverages the chunked prefill of the key-value cache, which concurrently generates query-based extractive summaries at each step of the prefill that are subsequently reintegrated at the end of the computation. We term these intermediate outputs “margins”, drawing inspiration from the practice of making margin notes for improved comprehension of long contexts in human reading. We show that this technique, which adds only minimal additional computation, significantly improves LLMs long context reasoning capabilities.
Think: Every chunk has a chance to be attended to/ be at the end of the context at least once. 🎉
📊 Results: - An average accuracy boost of 7.5% in multi-hop reasoning tasks like HotpotQA and MultiHop-RAG. - Even a 30% increase in F1-score for summarisation-like tasks (CWE).
Plus, WiM fits seamlessly into interactive applications (think: progress bar!). It can provide real-time progress updates during data retrieval and integration, making it user-friendly and transparent - a stark contrast to feeding 1mln tokens to an LLMs and waiting 6 min for the first token. 🤯
Given the impressive benchmarks published my Meta for their Llama-3.1 models, I was curious to see how these models would compare to top proprietary models on Chatbot Arena.
Now we've got the results! LMSys released the ELO derived from thousands of user votes for the new models, and here are the rankings:
💥 405B Model ranks 5th overall, in front of GPT-4-turbo! But behind GPT-4o, Claude-3.5 Sonnet and Gemini-advanced. 👏 70B Model climbs up to 9th rank ! From 1206 ➡️ 1244. 👍 8B Model improves from 1152 ➡️ 1170.
✅ This confirms that Llama-3.1 is a good contender for any task: any of its 3 model size is much cheaper to run than equivalent proprietary models!
For instance, here are the inference prices for the top models; ➤ GPT-4-Turbo inference price from OpenAI: $5/M input tokens, $15/M output tokens ➤ Llama-3.1-405B from HF API (for testing only): 3$/M for input or output tokens (Source linked in the first comment) ➤ Llama-3.1-405B from HF API (for testing only): free ✨
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!
We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.
To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.
Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.
Would love to answer any questions you have so feel free to add them below!
NuMind has just released 3 new state-of-the-art GLiNER models for Named Entity Recognition/Information Extraction. These GLiNER models allow you to specify any label that you want, and it'll find spans in the text corresponding to your label. It's been shown to work quite well on unusual domains, e.g. celestial entities in my picture.
There are 3 models released: - numind/NuNER_Zero: The primary model, SOTA & can detect really long entities. - numind/NuNER_Zero-span: Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens. - numind/NuNER_Zero-4k: Slightly worse than NuNER Zero, but has a context length of 4k tokens.
Some more details about these models in general: - They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance. - Because they're small - they're fast: <1s per sentence on free GPUs. - They have an MIT license: free commercial usage.