Using Autoresearch Project to Build The Fastestest Java Decompiler

20 Mar, 2026

Most people first see this Andrej's Autoresearch project as an ML autotuning setup: an agent edits one file (train.py), runs short experiments, and keeps only measurable improvements. Under the hood, though, the real value is not “LLM training.” The real value is the architecture: a closed-loop research system with explicit goals, constrained change scope, objective evaluation, and hard keep/revert rules.

That pattern transfers cleanly to systems work, including decompiler optimization which is the pivotal part of Jar.Tools. I called my decompilation engine IPND, which I wanted to be the fastest was to decompile a java class into readable Java source code.

The Core Architecture (Domain-Agnostic)

At a high level, this project separates policy from execution:

Policy lives in program.md (what to optimize, what constraints matter, what constitutes a win).
Execution lives in code and tooling (train.py, test harnesses, benchmarks, profiling tools).
Decisions are recorded as structured experiment outcomes (baseline vs current).

flowchart LR A[Human defines objective + constraints] --> B[Program spec / playbook] B --> C[Agent proposes code change] C --> D[Run harness] D --> E[Collect metrics] E --> F{Beats baseline?} F -- yes --> G[Keep change] F -- no --> H[Discard/Revert] G --> I[Update baseline + log] H --> I I --> C

Three design choices make this robust across domains:

Fixed evaluation protocol: same benchmark shape each iteration, so comparisons stay valid.
Explicit baseline: every candidate is judged relative to a known reference, not gut feeling.
Tight loop latency: faster iteration means more hypotheses tested per hour.

In ML, the metric is validation bits-per-byte. In decompiler work, the metric can be latency, memory, correctness parity, or all three.

Mapping the Architecture to the Decompiler Project

For the decompiler, I used the same loop but swapped in system-level targets:

Performance targets: class and jar decompilation latency.
Memory targets: RSS/HWM behavior under realistic workloads.
Correctness guardrails: API tests, decompiler tests, parity checks.
Output contract: stable API behavior and artifact format.

The implementation already has natural seams for this:

crates/core: parsing, decompilation, emit logic.
crates/api: HTTP contract, job orchestration, artifact generation.
crates/cli and crates/ffi: integration surfaces.

flowchart TD C1[Client: class/jar upload] --> A1[API routes] A1 --> A2[Auth + input normalization] A2 --> A3{Class sync path or Jar async job} A3 -->|Class| C2[Parse classfile + emit source] A3 -->|Jar| J1[Create job + persist upload] J1 --> J2[Worker decompile loop] J2 --> K1[Core parser/decompiler/emitter] K1 --> Z1[Artifact ZIP + SUMMARY.md] Z1 --> R1[Status + download endpoints]

This is exactly what makes the architecture reusable: once a system has deterministic entry points and measurable outputs, it can be optimized by the same research loop regardless of domain.

How I Used It for Decompiler Logic Improvements

The practical cycle looked like this:

Establish baseline with fixed corpus and repeat count.
Profile CPU and memory hotspots (perf, heaptrack, runtime summaries).
Hypothesize a change (for example: zip writer mode, decompile path behavior, branch coverage for edge cases).
Patch and validate with tests and coverage gates.
Re-benchmark and compare against baseline.
Keep only measurable wins.

flowchart LR B[Baseline run] --> P[CPU + memory profiling] P --> H[Hotspot hypothesis] H --> X[Code change] X --> T[Tests + coverage] T --> R[Benchmark rerun] R --> D{Latency/memory better and correctness intact?} D -- yes --> K[Keep + document delta] D -- no --> N[Drop/iterate] K --> B N --> H

This gave us concrete, decision-ready metrics instead of anecdotal “feels faster” claims. Example outcomes from the class decompilation track:

Mean latency improved versus baseline in repeated top-class comparisons.
Memory improved in the same benchmark family, with process-level peak RSS/HWM reductions.
Coverage was raised and enforced with cargo llvm-cov --fail-under-lines 90 on the API package, so optimization work did not erode reliability.

Notable Speed Changes With Code Examples

Below are concrete code-level changes that helped performance in the decompiler path.

1) Parallelize method decompilation only when class size justifies it

In crates/core/src/emit/mod.rs, method bodies are decompiled in parallel only for sufficiently large classes. Small classes stay serial to avoid scheduler overhead.

fn should_parallelize_method_decompile(coded_method_count: usize, total_code_bytes: usize) -> bool {
    coded_method_count >= 24 && total_code_bytes >= 12_000 && method_decompile_parallelism() > 1
}

if should_parallelize_method_decompile(coded_methods.len(), total_code_bytes) {
    let results = coded_methods
        .par_iter()
        .map(|(method_index, method)| {
            (*method_index, crate::decompile::decompile_method_v1(class, method, *method_index, decompile_opts))
        })
        .collect::<Vec<_>>();
    // write back results...
}

Why it matters:

The threshold gate prevents regressions on tiny classes.
For big classes (many coded methods), this unlocks multicore throughput where most time is spent.
Thread pool reuse (OnceLock<ThreadPool>) avoids rebuilding pools per class.

2) Replace map-heavy method body storage with indexed slots

The emitter path uses vector-indexed storage for method bodies and moves values out with take(), reducing lookup and clone overhead.

let mut method_bodies: Vec<Option<crate::decompile::MethodBody>> = vec![None; class.methods.len()];
// fill method_bodies[method_index] = Some(body)

let body = method_bodies
    .get_mut(method_index)
    .and_then(|slot| slot.take());

Why it matters:

Indexing by method index is O(1) and cache-friendly.
Option::take() avoids repeated cloning of large body structures.
This directly improves the hot emit path for classes with many methods.

3) Add no-allocation fast paths in identifier rewriting

String-rewrite utilities now bail out immediately when there is nothing to replace, instead of always allocating an output string.

fn replace_identifier_all_if_needed(source: &str, from: &str, to: &str) -> Option<String> {
    if from.is_empty() || from == to { return None; }
    if !source.contains(from) { return None; }
    // rewrite only if needed...
    Some(out)
}

Why it matters:

Large decompiled method bodies often do not need renaming rewrites.
Avoiding unnecessary allocations cuts CPU and memory churn in post-processing.
This is a classic hot-path optimization with low risk and high frequency payoff.

4) Optimize artifact ZIP write path for throughput

For output packaging, I moved to low-cost compression by default and made “stored” mode configurable for memory-sensitive runs.

let file_options = if use_stored_artifact_entries() {
    SimpleFileOptions::default().compression_method(CompressionMethod::Stored)
} else {
    SimpleFileOptions::default()
        .compression_method(CompressionMethod::Deflated)
        .compression_level(Some(1))
};

Why it matters:

Deflate level 1 significantly reduces compression CPU cost versus higher levels.
Stored mode can further reduce transient memory pressure when compression is not needed.
This affects jar decompile end-to-end latency because artifact creation is on the critical path.

What was measured

On our class benchmark track (largest-class/top-N comparisons), current vs baseline showed sustained improvements:

Top-10 class retest: mean latency delta -18.44% and weighted delta -24.67%.
Randomized top-10 sample from top-50 classes (two-run aggregate): mean delta about -16%, weighted delta about -27%.

These changes were only kept when they held against baseline under the same harness and passed the regression tests.

Current Full-Jar Decompiler vs CFR Numbers (Current Checkout)

To compare full jar decompilation (not per-class microbenchmarks), I used the same input jar for both tools:

Jar: external_jars/commons-lang3-3.14.0.jar (404 classes)
Passes: 3
Host CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
Java: openjdk version "21.0.9" 2025-10-21

Commands used:

# IPND full-jar decompile (API worker path)
IPND_PERF_PASSES=3 IPND_PERF_BUDGET_MS=50 \
cargo test -p ipdn perf_run_decompile_job_common_jar -- --ignored --nocapture

# CFR full-jar decompile (whole-jar invocation, 3 passes)
java -jar dist/tools/cfr-0.152.jar external_jars/commons-lang3-3.14.0.jar --outputdir <tmp> --silent true

Aggregate latency results (full jar)

Slice	IPND mean (ms)	CFR mean (ms)	CFR/IPND ratio
Overall (all passes)	369.440	5998.479	16.237x
Cold pass only (pass 1)	409.150	6397.316	15.636x
Warm passes only (pass 2-3)	349.585	5799.060	16.588x

Supporting percentiles from the same run set:

Overall p50: IPND 352.106 ms, CFR 5913.400 ms
Overall p95: IPND 409.150 ms, CFR 6397.316 ms

Output artifact context:

IPND artifact zip size: 416735 bytes
CFR extracted source size: 1173989 bytes across 248 files

Interpretation:

On this full-jar workload, current IPND is about 16x faster than CFR on the same machine.
Unlike class microbenchmarks, this comparison includes full pipeline cost (jar indexing, decompile loop, output materialization).
The per-class CFR benchmark remains useful for instruction-level tuning, but jar-level numbers are the right KPI for end-user jar decompilation throughput.

Why This Architecture Scales Beyond ML and Decompilers

The pattern works anywhere you can define:

a controllable change surface,
a repeatable run harness,
and a strict scoring function.

That includes compilers, API backends, data pipelines, search ranking services, and frontend rendering performance.

The transferable blueprint is:

Define objective as a metric, not a story.
Lock evaluation protocol.
Automate measurement and diffing.
Require objective keep/revert decisions.
Track baseline drift explicitly.

If you do just these five things, “autonomous research” stops being an ML novelty and becomes a general engineering operating model.

Under-the-Hood Components That Matter Most

A lot of teams underestimate this part. The architecture only works when each component is explicit and stable:

Objective layer: a single optimization target plus guardrail targets.
Mutation layer: where code changes are allowed (for ML here it was primarily train.py; for decompiler work I intentionally touched scoped Rust modules).
Execution layer: deterministic commands for tests, perf runs, and memory runs.
Evaluation layer: scripts and logs that produce machine-readable deltas (summary|..., memory|...).
Decision layer: keep/drop rules that run the same way every iteration.

Without this separation, optimization efforts drift into ad-hoc debugging. With it, every iteration contributes to a cumulative research trajectory.

A Practical Template for Other Projects

If you want to adapt this architecture to a new project, start with a minimal contract:

Pick one benchmark corpus that matches production pressure.
Record one immutable baseline run.
Define pass/fail thresholds for regressions.
Automate one command that prints current vs baseline.
Gate merges on those numbers.

Once that is in place, you can scale out to multi-objective optimization (speed, memory, reliability, quality) without losing control of experiment integrity.

Closing

What started as an ML experiment loop is really a system for disciplined optimization under uncertainty. In our decompiler work, that architecture let us improve speed and memory without sacrificing correctness or API stability. The key was not domain-specific tricks; it was the loop design itself: baseline, profile, patch, verify, compare, repeat.