We conduct fundamental research into the science of scheming and its potential mitigations. We also develop and run pre-deployment evaluations of frontier AI systems.
Stress Testing Deliberative Alignment for Anti-Scheming Training
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Frontier Models are Capable of In-Context Scheming
Towards Safety Cases For AI Scheming
Large Language Models can Strategically Deceive their Users when Put Under Pressure
A Causal Framework for AI Regulation and Auditing
All posts
Filter
- All
- Science of Scheming
- Evaluations
- Interpretability
- Notes
We Need A Science of Scheming
Advanced AI systems face strong incentives to scheme. Apollo Research is building a science of scheming to predict and prevent this risk.
Science of Scheming
Metagaming matters for training, evaluation, and oversight
Science of Scheming
Stress Testing Deliberative Alignment for Anti-Scheming Training
Evaluations
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Evaluations
Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals
Evaluations
More Capable Models Are Better At In-Context Scheming
Evaluations
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Evaluations
Forecasting Frontier Language Model Agent Capabilities
Interpretability
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Interpretability
Detecting Strategic Deception Using Linear Probes
Evaluations
Demo Example – Scheming Reasoning Evaluations
Evaluations
Frontier Models are Capable of In-Context Scheming
Evaluations
The Evals Gap
Evaluations
Towards Safety Cases For AI Scheming
Evaluations
An Opinionated Evals Reading List
Interpretability
Identifying functionally important features with end-to-end sparse dictionary learning
Interpretability
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Evaluations
Black-Box Access is Insufficient for Rigorous AI Audits
Evaluations
We Need A ‘Science of Evals’
Evaluations
A Starter Guide For Evals
Evaluations
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Evaluations
A Causal Framework for AI Regulation and Auditing
Evaluations
Our research on strategic deception presented at the UK’s AI Safety Summit
Evaluations