zzfive
's Collections
datasets
updated
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with
Millions of Real Click Labels
Paper
•
2405.07526
•
Published
•
18
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based
Approach
Paper
•
2405.15613
•
Published
•
13
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
14
How Do Large Language Models Acquire Factual Knowledge During
Pretraining?
Paper
•
2406.11813
•
Published
•
30
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
50
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
•
2406.10328
•
Published
•
17
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
20
StableSemantics: A Synthetic Language-Vision Dataset of Semantic
Representations in Naturalistic Images
Paper
•
2406.13735
•
Published
•
5
Stylebreeder: Exploring and Democratizing Artistic Styles through
Text-to-Image Models
Paper
•
2406.14599
•
Published
•
16
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
•
2406.20094
•
Published
•
96
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
7
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video
Generation
Paper
•
2407.02371
•
Published
•
51
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
Paper
•
2406.19380
•
Published
•
47
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
•
2407.03958
•
Published
•
18
MiraData: A Large-Scale Video Dataset with Long Durations and Structured
Captions
Paper
•
2407.06358
•
Published
•
19
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Paper
•
2407.10957
•
Published
•
23
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language
Parallel Corpus
Paper
•
2407.11144
•
Published
•
8
Visual Text Generation in the Wild
Paper
•
2407.14138
•
Published
•
9
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
•
2407.19795
•
Published
•
11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End
Modeling with LM Knowledge Distillation
Paper
•
2408.00205
•
Published
•
4
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Paper
•
2408.02629
•
Published
•
13
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular
Annotations for Medicine
Paper
•
2408.02900
•
Published
•
25
Diffusion Models as Data Mining Tools
Paper
•
2408.02752
•
Published
•
13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
•
2408.03900
•
Published
•
9
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
•
2408.04594
•
Published
•
14
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads
Paper
•
2407.18245
•
Published
•
8
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays
Paper
•
2408.06281
•
Published
•
9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic
Mathematical Reasoning
Paper
•
2408.07089
•
Published
•
13
Paper
•
2408.05366
•
Published
•
11
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
•
2408.08441
•
Published
•
7
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
•
2408.16176
•
Published
•
7
ClimDetect: A Benchmark Dataset for Climate Change Detection and
Attribution
Paper
•
2408.15993
•
Published
•
7
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper
•
2409.01437
•
Published
•
70
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable
Transcripts
Paper
•
2409.00447
•
Published
•
2
HumanVid: Demystifying Training Data for Camera-controllable Human Image
Animation
Paper
•
2407.17438
•
Published
•
23
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation
Paper
•
2411.04709
•
Published
•
25
Improving the detection of technical debt in Java source code with an
enriched dataset
Paper
•
2411.05457
•
Published
•
2
GitChameleon: Unmasking the Version-Switching Capabilities of Code
Generation Models
Paper
•
2411.05830
•
Published
•
20
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
•
2411.07461
•
Published
•
21
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation
Paper
•
2411.08380
•
Published
•
25
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
47
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
•
2411.14794
•
Published
•
12
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
26
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
•
2412.00947
•
Published
•
7
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases
in Multilingual Evaluation
Paper
•
2412.03304
•
Published
•
17
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based
Image Editing
Paper
•
2412.04280
•
Published
•
13
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
•
2412.05237
•
Published
•
46
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
•
2412.04626
•
Published
•
11
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
•
2412.08580
•
Published
•
45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
•
2412.07147
•
Published
•
5
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
•
2412.08687
•
Published
•
13