xmxx
's Collections
Daily paper that is inspiring (abstract is enough)
updated
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
37
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
79
Chain-of-Thought Reasoning Without Prompting
Paper
•
2402.10200
•
Published
•
104
FiT: Flexible Vision Transformer for Diffusion Model
Paper
•
2402.12376
•
Published
•
48
Paper
•
2402.13144
•
Published
•
95
Aria Everyday Activities Dataset
Paper
•
2402.13349
•
Published
•
30
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
32
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
30
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
Adapting LLaMA Decoder to Vision Transformer
Paper
•
2404.06773
•
Published
•
17
Rho-1: Not All Tokens Are What You Need
Paper
•
2404.07965
•
Published
•
88
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and
Training Strategies
Paper
•
2404.08197
•
Published
•
27
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
•
2405.12130
•
Published
•
46
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
34
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
24
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
LLMs achieve adult human performance on higher-order theory of mind
tasks
Paper
•
2405.18870
•
Published
•
17
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step
Paper
•
2406.04314
•
Published
•
27
Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation
Paper
•
2406.06525
•
Published
•
65
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
25
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
•
2406.04692
•
Published
•
55
GenAI Arena: An Open Evaluation Platform for Generative Models
Paper
•
2406.04485
•
Published
•
20
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
32
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
50
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
20
Instruction Pre-Training: Language Models are Supervised Multitask
Learners
Paper
•
2406.14491
•
Published
•
86
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like
Molecules and a Benchmark for Neural Network Potentials
Paper
•
2406.14347
•
Published
•
98
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
32
Video-Infinity: Distributed Long Video Generation
Paper
•
2406.16260
•
Published
•
28
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
Scale
Paper
•
2406.17557
•
Published
•
87
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
41
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
•
2406.19280
•
Published
•
61
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
•
2406.20095
•
Published
•
17
LiteSearch: Efficacious Tree Search for LLM
Paper
•
2407.00320
•
Published
•
37
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video
Generation
Paper
•
2407.02371
•
Published
•
51
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
Sparse Architectural Large Language Models
Paper
•
2407.01906
•
Published
•
34
Video-STaR: Self-Training Enables Video Instruction Tuning with Any
Supervision
Paper
•
2407.06189
•
Published
•
26
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Paper
•
2407.13623
•
Published
•
53