Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2407.16198

about 16 hours ago

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 87
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 31

Interesting papers

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23, 2024 • 13
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

Paper • 2408.13467 • Published Aug 24, 2024 • 25
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Paper • 2409.09214 • Published Sep 13, 2024 • 50

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11, 2024 • 17
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 43
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 68
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10, 2024 • 40

Multi-modality LVM

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18, 2024 • 29
TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18, 2024 • 34
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21, 2024 • 9
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18, 2024 • 14

Localize Viusal Understanding

GLaMM: Pixel Grounding Large Multimodal Model

Paper • 2311.03356 • Published Nov 6, 2023 • 33
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Paper • 2311.07575 • Published Nov 13, 2023 • 13
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Paper • 2311.03354 • Published Nov 6, 2023 • 4
Language-Informed Visual Concept Learning

Paper • 2312.03587 • Published Dec 6, 2023 • 5

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 50
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 14
VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Paper • 2406.10227 • Published Jun 14, 2024 • 9
What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 39

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 40
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 20

about 6 hours ago

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 181
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 15
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 48
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 41

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs