Xiaosen Zheng's picture

Xiaosen Zheng

xszheng2020

·

AI & ML interests

Data-Centric AI and AI Safety.

Recent Activity

liked a model 2 days ago

Snowflake/snowflake-arctic-embed-m-v2.0

upvoted an article 2 days ago

Fine-tune ModernBERT for text classification using synthetic data

liked a dataset 6 days ago

davidbrandfonbrener/color-filtered-c4

View all activity

Organizations

xszheng2020's activity

upvoted an article 2 days ago

Article

Fine-tune ModernBERT for text classification using synthetic data

By

•

5 days ago

• 17

upvoted 2 collections 7 days ago

NeMo Curator - Classifier Models

Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 9 items • Updated 21 days ago • 10

FastText Model for Pretraining Data Curation

4 items • Updated Oct 30, 2024 • 1

upvoted a paper 16 days ago

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Paper • 2412.13018 • Published 17 days ago • 41

upvoted 2 collections about 1 month ago

🔱 Sailor2 Language Models

Sailing in South-East Asia with Inclusive Multilingual LLMs • 9 items • Updated Dec 3, 2024 • 22

DCLM Pools

Raw pools for use in DCLM competition • 5 items • Updated Jul 17, 2024 • 1

upvoted a paper about 2 months ago

Sample-Efficient Alignment for LLMs

Paper • 2411.01493 • Published Nov 3, 2024 • 10

upvoted a paper 2 months ago

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12, 2024 • 65

upvoted a collection 2 months ago

MagpieLM

Aligning LMs with Fully Open Recipe + Synthetic Data Generated from Open-Source LMs. • 9 items • Updated about 8 hours ago • 15

upvoted a paper 2 months ago

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Paper • 2410.18693 • Published Oct 24, 2024 • 40

upvoted 3 collections 2 months ago

ScaleQuest

We introduce ScaleQuest, a scalable and novel data synthesis method. Project Page: https://scalequest.github.io/ • 8 items • Updated Oct 25, 2024 • 5

C4AI Aya Expanse

Aya Expanse is an open-weight research release of a model with highly advanced multilingual capabilities. • 3 items • Updated 18 days ago • 30

BGE

23 items • Updated 14 days ago • 72

upvoted an article 3 months ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16, 2024

• 294

upvoted a paper 3 months ago

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29, 2024 • 48

upvoted an article 3 months ago

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20, 2024

• 71

upvoted 2 papers 3 months ago

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Paper • 2410.07137 • Published Oct 9, 2024 • 7

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Paper • 2409.17115 • Published Sep 25, 2024 • 60

upvoted a paper 4 months ago

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Paper • 2408.13359 • Published Aug 23, 2024 • 22

upvoted a collection 4 months ago

Power-LM

Dense & MoE LLMs trained with power learning rate scheduler. • 4 items • Updated Oct 17, 2024 • 15