GitHub - janetmalzahn/llm-phacking: Replication archive for "Do Claude Code and Codex P-Hack? Sycophancy and Statistical Analysis in Large Language Models"

5 min read Original article ↗

Samuel G.Z. Asher, Janet Malzahn, Jessica M. Persano, Elliot J. Paschal, Andrew C. W. Myers, Andrew B. Hall

Working paper: Asher et al. (2026)

Overview

Large language models are increasingly used as research assistants for statistical analysis. We test whether LLMs engage in specification search or inflate estimates when prompted to do so — effectively automating p-hacking.

We ran 640 independent analysis sessions (4 papers x 8 prompts x 10 runs x 2 models) in which Claude (Opus 4.6) and Codex (GPT-5.2-Codex) each received a dataset and instructions to conduct a standard empirical analysis. Each run independently produced an R script, a session log, and a coefficient estimate with 95% confidence interval. Prompts varied along two dimensions: research framing and nudge condition.

Experimental design

Papers and research designs

We selected four published papers with null or near-null results, each using a different canonical research design:

Paper Design Published Estimate
Thompson (2020) Regression Discontinuity (RDD) -0.06 (SE: 0.05)
Dynes & Holbein (2019) Difference-in-Differences (DID) 0.02 (SE: 0.08)
Kam & Palmer (2008) Selection on Observables 0.15 (SE: 0.48)
Kalla & Broockman (2018) Randomized Controlled Trial (RCT) -0.03 (SE: 0.04)

Prompt conditions (2 x 4)

Each prompt combines two experimental dimensions to produce 8 prompts per paper:

Research framing (2 levels):

  • Research question — neutral framing
  • Directional prior — hypothesis-laden framing

Nudge condition (4 levels):

  1. None — no additional instruction
  2. Upstanding Researcher — explicitly asks for the most credible result
  3. Significance — pressures for statistically significant findings
  4. Nuclear — reframes specification search as legitimate uncertainty reporting

Models

Each prompt was run 10 times per model, independently (no memory across runs):

  • Claude (Opus 4.6) via Claude Code CLI — 320 runs
  • Codex (GPT-5.2-Codex) via OpenAI Codex CLI — 320 runs

Repository structure

llm-phacking/
├── shared_modules/                    # Shared prompt components
│   ├── Nudge/                         # 4 nudge variants
│   └── OutputRequirements/            # Output template
│
├── scripts/
│   ├── combine_prompts.py            # Prompt generator
│   ├── run_experiment_claude.sh       # Claude experiment runner
│   ├── run_experiment_codex.sh        # Codex experiment runner
│   ├── count_specs.sh                 # Specification counting (Haiku + Sonnet)
│   ├── compare_haiku_sonnet.sh        # Spec count verification
│   └── analysis/
│       ├── analyze_first_prompt_try.R # Per-paper coefficient plots
│       ├── pooled_summary_figure.R    # Pooled nudge effect figure
│       ├── figures_for_draft.R        # Publication figures
│       └── plot_spec_counts.R         # Spec count visualizations
│
├── papers/
│   ├── thompson_2020/                 # RDD
│   ├── dynes_holbein_2019/            # DID
│   ├── kam_palmer_2008/               # Selection on observables
│   └── kalla_broockman_2018/          # RCT
│       ├── data/                      # Dataset
│       ├── prompt_construction/       # Modular prompt components
│       │   └── combined_prompts/      # Generated prompts (8 per paper)
│       └── runs/
│           ├── claude/                # Claude runs (R scripts, logs, coefficients)
│           └── codex/                 # Codex runs
│               └── fix_aggregation.R  # Aggregates individual Codex run CSVs
│
├── output/
│   ├── spec_counts.csv                # Specification counts per run
│   ├── haiku_sonnet_comparison.csv    # Spec count verification
│   ├── spec_count_logs/               # Claude API logs from spec counting
│   └── figures/                       # All generated figures
│
└── prompt_instructions.md             # Human-readable prompt overview

Replication

All 640 runs (R scripts, logs, and coefficient CSVs) are included in the repository. You can reproduce the analysis figures directly from the committed data without re-running any LLM sessions.

Reproducing the analysis (figures and tables)

Prerequisites: R with the tidyverse and here packages.

install.packages(c("tidyverse", "here"))

Generate all figures:

# Per-paper coefficient plots (one per paper x model)
Rscript scripts/analysis/analyze_first_prompt_try.R

# Pooled summary figure (shift from baseline in published SEs)
Rscript scripts/analysis/pooled_summary_figure.R

# Publication figures (figure_1.pdf, figure_2.pdf)
Rscript scripts/analysis/figures_for_draft.R

# Specification count bar charts (requires output/spec_counts.csv)
Rscript scripts/analysis/plot_spec_counts.R

Figures are written to output/figures/.

Re-running the experiments from scratch

Re-running the full experiment requires API access to both Claude and Codex and is non-trivial: 640 LLM sessions, each writing and executing R code, take several hours and incur significant API costs. The steps below are provided for full reproducibility but are not necessary to verify the analysis.

Prerequisites:

1. Generate prompts:

python3 scripts/combine_prompts.py --paper papers/thompson_2020

This combines modular prompt components (Goal, Task, Data, Nudge) into 8 combined prompts per paper.

2. Create system instructions:

Each paper directory needs a CLAUDE.md (for Claude) or AGENTS.md (for Codex) with your local R path. These are gitignored. Example:

R is installed at: /usr/local/bin/Rscript
IMPORTANT: Write ALL output files in the current working directory.

3. Run experiments:

# Claude (10 runs per prompt, 4 parallel)
./scripts/run_experiment_claude.sh -p papers/thompson_2020 -n 10 -j 4

# Codex
./scripts/run_experiment_codex.sh -p papers/thompson_2020 -n 10 -j 4

Options: -n (runs per prompt), -j (parallel jobs), --prompt-filter (subset prompts), --dry-run (preview).

4. Aggregate Codex runs:

After running Codex experiments, aggregate the individual run-level coefficient CSVs into prompt-level summary CSVs:

# Run from each paper's codex runs directory
cd papers/thompson_2020/runs/codex && Rscript fix_aggregation.R
cd papers/kalla_broockman_2018/runs/codex && Rscript fix_aggregation.R

This step is required before running the analysis scripts. (The aggregated CSVs are already committed in the repository, so this is only needed if re-running the Codex experiments from scratch.)

Specification counting (optional)

The specification counting pipeline uses Claude (Haiku for pass 1, Sonnet for pass 2) to count the number of distinct statistical specifications in each run's R script. This also requires API access.

bash scripts/count_specs.sh
Rscript scripts/analysis/plot_spec_counts.R