The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Abstract
The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
Community
Thanks for sharing! Our project website: https://allenai.github.io/re-align/ (more updates are coming soon, please stay tuned!) 🤗
I f***ing knew it, when you read LIMA it is already obvious all alignment methods are overengineered...
Too bad LIMA guys did not also try to use just 10 training examples, this way comparing with URIAL would me more fair.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Black-Box Prompt Optimization: Aligning Large Language Models without Model Training (2023)
- Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis (2023)
- Zephyr: Direct Distillation of LM Alignment (2023)
- CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment (2023)
- ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up? (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Really interesting paper, Id be curious to see how fine-tunings with domain specific data (ie Finance, etc) compare and see if all this remains the same. I love that this is shedding light on our understanding of LLMs and how they work!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper