Collections

7

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 87
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 31

1

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 47
Adaptable Logical Control for Large Language Models

Paper • 2406.13892 • Published Jun 19, 2024 • 1
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation

Paper • 2406.19215 • Published Jun 27, 2024 • 29
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Paper • 2405.14831 • Published May 23, 2024 • 3

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

Matryoshka Multimodal Models

Octo-planner: On-device Language Model for Planner-Action Agents

Adaptable Logical Control for Large Language Models

SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Transformers meet Neural Algorithmic Reasoners

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

The Prompt Report: A Systematic Survey of Prompting Techniques

CRAG -- Comprehensive RAG Benchmark

Transformers meet Neural Algorithmic Reasoners

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Spectrally Pruned Gaussian Fields with Neural Compensation

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

MoAI: Mixture of All Intelligence for Large Language and Vision Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

BLINK: Multimodal Large Language Models Can See but Not Perceive

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

DocLLM: A layout-aware generative language model for multimodal document understanding

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations