Yedson54
's Collections
Alignment and Unlearning
updated
Learn Your Reference Model for Real Good Alignment
Paper
•
2404.09656
•
Published
•
82
Aligning Teacher with Student Preferences for Tailored Training Data
Generation
Paper
•
2406.19227
•
Published
•
24
Self-Play Preference Optimization for Language Model Alignment
Paper
•
2405.00675
•
Published
•
25
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
•
2404.03820
•
Published
•
24
Iterative Nash Policy Optimization: Aligning LLMs with General
Preferences via No-Regret Learning
Paper
•
2407.00617
•
Published
•
7
UnUnlearning: Unlearning is not sufficient for content regulation in
advanced generative AI
Paper
•
2407.00106
•
Published
•
5
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
•
2406.12624
•
Published
•
36
Simulating Classroom Education with LLM-Empowered Agents
Paper
•
2406.19226
•
Published
•
30
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
and Refusals of LLMs
Paper
•
2406.18495
•
Published
•
12
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
Safer Language Models
Paper
•
2406.18510
•
Published
•
8
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Paper
•
2406.11614
•
Published
•
4
Large Language Model Unlearning via Embedding-Corrupted Prompts
Paper
•
2406.07933
•
Published
•
7
Deep Bayesian Active Learning for Preference Modeling in Large Language
Models
Paper
•
2406.10023
•
Published
•
2
Transforming and Combining Rewards for Aligning Large Language Models
Paper
•
2402.00742
•
Published
•
11
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
•
2401.18058
•
Published
•
20
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Paper
•
2407.10058
•
Published
•
29
To Forget or Not? Towards Practical Knowledge Unlearning for Large
Language Models
Paper
•
2407.01920
•
Published
•
13
Rethinking Entity-level Unlearning for Large Language Models
Paper
•
2406.15796
•
Published
The Art of Saying No: Contextual Noncompliance in Language Models
Paper
•
2407.12043
•
Published
•
4
Instruction Following without Instruction Tuning
Paper
•
2409.14254
•
Published
•
27
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
•
2410.09584
•
Published
•
47