Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper
•
2404.03715
•
Published
•
61
A batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning