Science – Apollo Research

2 min read Original article ↗

We conduct fundamental research into the science of scheming and its potential mitigations. We also develop and run pre-deployment evaluations of frontier AI systems.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Frontier Models are Capable of In-Context Scheming

Towards Safety Cases For AI Scheming

Large Language Models can Strategically Deceive their Users when Put Under Pressure

A Causal Framework for AI Regulation and Auditing

All posts

Filter

  • All
  • Science of Scheming
  • Evaluations
  • Interpretability
  • Notes

We Need A Science of Scheming

Advanced AI systems face strong incentives to scheme. Apollo Research is building a science of scheming to predict and prevent this risk.

Science of Scheming

Metagaming matters for training, evaluation, and oversight

Science of Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Evaluations

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Evaluations

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Evaluations

More Capable Models Are Better At In-Context Scheming

Evaluations

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Evaluations

Forecasting Frontier Language Model Agent Capabilities

Interpretability

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Interpretability

Detecting Strategic Deception Using Linear Probes

Evaluations

Demo Example – Scheming Reasoning Evaluations

Evaluations

Frontier Models are Capable of In-Context Scheming

Evaluations

The Evals Gap

Evaluations

Towards Safety Cases For AI Scheming

Evaluations

An Opinionated Evals Reading List

Interpretability

Identifying functionally important features with end-to-end sparse dictionary learning

Interpretability

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Evaluations

Black-Box Access is Insufficient for Rigorous AI Audits

Evaluations

We Need A ‘Science of Evals’

Evaluations

A Starter Guide For Evals

Evaluations

Large Language Models can Strategically Deceive their Users when Put Under Pressure

Evaluations

A Causal Framework for AI Regulation and Auditing

Evaluations

Our research on strategic deception presented at the UK’s AI Safety Summit

Evaluations

Understanding strategic deception and deceptive alignment