Governed Reward Experiment
Minimal runnable experiment for the thesis:
Reward is not reinforcement until admitted.
The experiment uses a ranking-only setup rather than model fine-tuning. Each synthetic coding task receives several candidate patch outcomes. A raw selector chooses the patch with the highest raw reward, while a governed selector chooses the patch with the highest admitted reward after invariant, exploit, causal, hidden-test, and delayed-regression checks.
Run
python3 governed_reward_experiment.py
Optional parameters:
python3 governed_reward_experiment.py --tasks 100 --candidates 7 --seed 11
Run the multi-seed selector and ablation suite:
python3 governed_reward_experiment.py \ --suite \ --tasks 100 \ --seed-start 10 \ --seed-end 30 \ --candidate-grid 3,5,7,10 \ --out results/governed_reward_suite.json \ --report results/governed_reward_suite_report.md
Run the executable real-code benchmark:
python3 real_codebase_experiment.py
Run the real-code benchmark with DeepSeek-generated patches:
read -rs DEEPSEEK_API_KEY export DEEPSEEK_API_KEY python3 real_codebase_experiment.py \ --generator deepseek \ --model deepseek-chat \ --llm-candidates 12 \ --out results/deepseek_real_codebase_results_12.json \ --report results/deepseek_real_codebase_report_12.md
Outputs
results/governed_reward_results.json: full task, candidate, selector, and audit-trace data.results/governed_reward_report.md: compact metrics report.results/governed_reward_suite.json: multi-seed selector and ablation data.results/governed_reward_suite_report.md: mean, standard deviation, 95% confidence interval, and paired effect-size summary.results/real_codebase_results.json: executable benchmark results from applying candidate patches to temporary Python packages and running tests.results/real_codebase_report.md: compact real-code benchmark report.results/deepseek_real_codebase_results_12.json: real-code benchmark using DeepSeek-generated candidate patches.results/deepseek_real_codebase_report_12.md: compact report for the DeepSeek-generated patch run.
Compared Selectors
RawSelector = argmax(raw_reward)WeightedScalarSelector = argmax(weighted_scalar_reward)GovernedSelector = argmax(admitted_reward)OracleSelector = argmax(known_ground_truth_quality)LargerDiffSelector = argmax(diff_size)
Raw reward uses visible tests, typecheck, lint, and build.
Governed reward applies hard overrides for test weakening, hardcoding, security violations, secrets, architecture violations, and public API breaks. Soft checks discount hidden-test failure, large diffs, weak causal attribution, unnecessary dependencies, and delayed regressions.
Headline Metrics
The report compares:
- visible test pass rate
- hidden test pass rate
- reward hacking rate
- hard invariant violation rate
- architecture violation rate
- delayed regression rate
- security regression rate
- public API break rate
- diff size
- maintainability
- causal score
- raw and admitted reward
- hidden-test pass per changed line
- robustness per diff size
Ablations
The suite compares full governance against removing:
- hard invariant filter
- exploit detector
- hidden test evidence
- delayed regression evidence
- causal attribution
- maintainability discount
