GitHub - shubchat/loab: LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

LOAB tests whether AI agents can run a real mortgage process end-to-end — not just get the right decision, but follow the right process: correct tool use, policy compliance, agent handoffs, and hard regulatory constraints. Getting the answer right while skipping KYC isn't a pass. The current release covers three origination tasks as a proof-of-concept, with credit decisioning, servicing, collections, and compliance tasks in development. Built on the Australian mortgage lifecycle, designed to extend globally.

Current benchmark version: v0.1.0
Policy baseline: MBL-POL-CREDIT-RESI-V3.2 (Effective 1 February 2025)
Change log: CHANGELOG.md

Why This Exists

Most AI benchmarks ask: did the model get the right answer?

In lending, that's not enough. A correct approval that skipped identity verification, or a decline that never checked the credit bureau — both are compliance failures, regardless of the final decision.

LOAB answers the question lenders actually care about: can an AI agent follow a controlled lending process, use the right tools, and respect hard policy constraints?

How Scoring Works

A run only passes if both the decision and the process are correct. Each run is evaluated across five rubric components:

Component	Weight	What It Measures
Outcome	30%	Final decision matches the expected result exactly
Tool Calls	25%	All required tools called with correct arguments, in the right step order
Handoffs	20%	Correct agent-to-agent routing with all required payload keys
Forbidden Actions	15%	No prohibited tools, decisions, or communications were executed
Evidence	10%	Tool responses contain the expected data fields in the agent's reasoning

Current Task Suite

The proof-of-concept covers three origination scenarios, each designed to test a different failure mode:

Task	Scenario	Expected Outcome	What It Tests
`task-01`	Prime PAYG borrower, complete file	`APPROVE`	Can the agent process a clean file without being overconservative?
`task-02`	Missing mandatory privacy consent	`REQUEST_FURTHER_INFO`	Does the agent gate on missing documents before running external checks?
`task-03`	Near-prime borrower, DTI > 6.0×	`DECLINE`	Does the agent enforce a hard policy limit with no exception pathway?

Benchmark Results — Origination PoC

Run config: 4 simulations per task × 2 models = 24 total runs

Full-Rubric Pass Rate

This is the headline metric. A run only counts as a pass when every rubric component is satisfied:

Task	Expected	GPT-5.2	Claude Opus 4.6
`task-01` — Clean approve	`APPROVE`	3/4 (75%)	0/4 (0%)
`task-02` — Missing docs gate	`REQUEST_FURTHER_INFO`	0/4 (0%)	1/4 (25%)
`task-03` — Hard DTI decline	`DECLINE`	0/4 (0%)	4/4 (100%)

The Key Insight: Outcome ≠ Process

A model can reach the right answer through the wrong process and still fail. This table shows how much they diverge:

Model	Outcome Accuracy	Full-Rubric Pass	Gap
GPT-5.2	8/12 (66.7%)	3/12 (25.0%)	−41.7pp
Claude Opus 4.6	9/12 (75.0%)	5/12 (41.7%)	−33.3pp

Both models lose substantial pass rates when process fidelity is required — this is the core signal LOAB is designed to surface.

Decision Distribution

How each model actually decided across 4 runs per task:

Task	GPT-5.2	Claude Opus 4.6
`task-01`	APPROVE ×4	CONDITIONAL_APPROVE ×3, APPROVE ×1
`task-02`	REQUEST_FURTHER_INFO ×4	REQUEST_FURTHER_INFO ×4
`task-03`	APPROVE ×2, CONDITIONAL_APPROVE ×2	DECLINE ×4

Component-Level Pass Rates

Where exactly each model breaks down:

Component	GPT-5.2	Claude Opus 4.6
Tool Calls	100%	83.3%
Handoffs	100%	100%
Step Decisions	100%	100%
Outcome	66.7%	75.0%
Forbidden Actions	25.0%	66.7%
Evidence	100%	100%

Key Findings

1. GPT-5.2 is stronger on clean approval paths

On task-01, GPT passed 75% of runs. It reliably reached APPROVE and handled the multi-step Processing Officer → Underwriter flow well. Failures came from process discipline, not from the credit answer itself. This suggests GPT is currently better aligned to straightforward prime-file progression.

2. Claude Opus 4.6 is stronger on hard-policy enforcement

On task-03, Claude passed 100% of runs. It always routed correctly to Credit Manager, always issued the hard decline, and consistently respected the DTI > 6.0× no-exception rule. This suggests Claude is currently better aligned to strict policy enforcement.

3. Task-02 is the process fidelity stress test

Both models always reached the correct outcome (REQUEST_FURTHER_INFO). But the full rubric required no external checks before resolving the missing privacy consent. GPT failed all 4 runs by performing external checks too early. Claude passed 1/4, failing the others for missing policy lookups or premature external checks. This is exactly the kind of separation LOAB is designed to create.

4. Run-to-run variance is a real deployment concern

GPT-5.2 split 50/50 between APPROVE and CONDITIONAL_APPROVE on task-03 — a task with a hard decline policy. Claude showed 3:1 CONDITIONAL_APPROVE vs APPROVE variance on task-01. Neither model produces deterministic behavior, which is a significant issue for production lending systems.

Roadmap

The full LOAB lifecycle suite is in active development:

Stage	Status	Example Scenario
Origination	✅ PoC live	Prime approve, missing docs gate, hard decline
Credit Decisioning	🔧 In dev	Self-employed DTI breach, sub-prime hard decline
Servicing	🔧 In dev	Loan discharge, closure tasks
Collections	🔧 In dev	Hardship assessment, collections suspension
Compliance	🔧 In dev	Synthetic identity fraud detection, SAR filing

Repository Structure

loab/
├── agents/           ← Role prompts + decision contracts (per agent)
├── benchmark/        ← Run configs, suite configs, leaderboard
├── company/          ← Meridian Bank policy, product rates, mock APIs
├── customers/        ← Synthetic applicant profiles + backstories
├── tasks/            ← Task definitions, rubrics, pending files
│   └── origination/
│       ├── task-01/  ← Clean PAYG approval
│       ├── task-02/  ← Missing privacy consent
│       └── task-03/  ← Hard DTI decline
└── results/          ← Run outputs (gitignored)

Quick Start

# Setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp loab/.env.example loab/.env   # Add your provider API keys

# Run a single task
python scripts/run_task.py --task origination/task-01

# Run a full suite (repeated runs)
python scripts/run_repeats.py --config loab/benchmark/suites/origination_poc_3x4.json --load-env

# Export comparison CSV
python scripts/export_benchmark_comparison.py \
  --suite-summary results/suite-summary-a.json results/suite-summary-b.json

Current Limitations

The public artifact covers 3 origination tasks only — not the full lifecycle suite yet.
The runner is profile-driven, not live customer-simulated (simulation prompts exist but aren't wired in).
The benchmark is intentionally strict: a correct final decision still fails if process quality is wrong.
Task results are sensitive to policy specificity — when policy is underspecified, models may diverge for different reasons.

Citation

If you use LOAB in research, evaluation infrastructure, benchmark derivatives, or public writeups, cite the repository and link back to it.

@misc{loab2026,
  title        = {LOAB: Lending Operations Agent Benchmark},
  author       = {LOAB contributors},
  year         = {2026},
  howpublished = {GitHub repository},
  note         = {Benchmark for multi-agent, tool-using lending workflows}
}

At minimum, include:

LOAB — Lending Operations Agent Benchmark
the repository link
the date or commit used for the evaluation

Versioning

LOAB uses semantic versioning for benchmark comparability:

MAJOR: breaking changes to benchmark semantics (scoring/orchestration/policy baseline that invalidate prior comparisons)
MINOR: additive comparable changes (new tasks, suites, models, charts, tooling)
PATCH: bug fixes and documentation updates that do not intentionally change benchmark semantics

Version source of truth:

loab/benchmark/VERSION

Every suite summary and exported comparison CSV includes metadata for:

benchmark version
git commit
policy document and effective date

Tag a release:

git tag -a v0.1.0 -m "LOAB benchmark v0.1.0"
git push origin v0.1.0

License

This repository is released under the MIT License.

You may use, modify, and build on LOAB, including commercial use, provided the license and copyright notice are preserved.