-
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Paper • 2409.02095 • Published • 36 -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Paper • 2409.03643 • Published • 19 -
UniDet3D: Multi-dataset Indoor 3D Object Detection
Paper • 2409.04234 • Published • 8
Collections
Discover the best community collections!
Collections including paper arxiv:2409.01704
-
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper • 2407.14177 • Published • 43 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 22 -
facebook/chameleon-7b
Image-Text-to-Text • Updated • 11.4k • 172 -
vidore/colpali
Updated • 28.1k • 405
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper • 2406.16860 • Published • 59 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 21 -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 51 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 124
-
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Paper • 2403.09029 • Published • 54 -
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 24 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 67 -
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 75
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 7 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 25 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 32 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 27
-
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper • 2402.11131 • Published • 42 -
Generative Representational Instruction Tuning
Paper • 2402.09906 • Published • 53 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 104 -
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper • 2402.10193 • Published • 19
-
Specialized Language Models with Cheap Inference from Limited Domain Data
Paper • 2402.01093 • Published • 45 -
Attention Heads of Large Language Models: A Survey
Paper • 2409.03752 • Published • 89 -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
jina-embeddings-v3: Multilingual Embeddings With Task LoRA
Paper • 2409.10173 • Published • 29