Our approach to alignment research

4 min read Original article ↗

RL from human feedback is our main technique for aligning our deployed language models today. We train a class of models called InstructGPT(opens in a new window) derived from pretrained language models such as GPT‑3. These models are trained to follow human intent: both explicit intent given by an instruction as well as implicit intent such as truthfulness, fairness, and safety.

Our results show that there is a lot of low-hanging fruit on alignment-focused fine-tuning right now: InstructGPT is preferred by humans over a 100x larger pretrained model, while its fine-tuning costs <2% of GPT‑3’s pretraining compute and about 20,000 hours of human feedback. We hope that our work inspires others in the industry to increase their investment in alignment of large language models and that it raises the bar on users’ expectations about the safety of deployed models.

Our natural language API is a very useful environment for our alignment research: It provides us with a rich feedback loop about how well our alignment techniques actually work in the real world, grounded in a very diverse set of tasks that our customers are willing to pay money for. On average, our customers already prefer to use InstructGPT over our pretrained models.

Yet today’s versions of InstructGPT are quite far from fully aligned: they sometimes fail to follow simple instructions, aren’t always truthful, don’t reliably refuse harmful tasks, and sometimes give biased or toxic responses. Some customers find InstructGPT’s responses significantly less creative than the pretrained models’, something we hadn’t realized from running InstructGPT on publicly available benchmarks. We are also working on developing a more detailed scientific understanding of RL from human feedback and how to improve the quality of human feedback.

Aligning our API is much easier than aligning AGI since most tasks on our API aren’t very hard for humans to supervise and our deployed language models aren’t smarter than humans. We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.

There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.

We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.

As we make progress on this, our AI systems can take over more and more of our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work together with humans to ensure that their own successors are more aligned with humans.

We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.

Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

Future versions of WebGPTInstructGPT, and Codex can provide a foundation as alignment research assistants, but they aren’t sufficiently capable yet. While we don’t know when our models will be capable enough to meaningfully contribute to alignment research, we think it’s important to get started ahead of time. Once we train a model that could be useful, we plan to make it accessible to the external alignment research community.