Using Autoresearch Project to Build The Fastestest Java Decompiler

8 min read Original article ↗

Most people first see this Andrej's Autoresearch project as an ML autotuning setup: an agent edits one file (train.py), runs short experiments, and keeps only measurable improvements. Under the hood, though, the real value is not “LLM training.” The real value is the architecture: a closed-loop research system with explicit goals, constrained change scope, objective evaluation, and hard keep/revert rules.

That pattern transfers cleanly to systems work, including decompiler optimization which is the pivotal part of Jar.Tools. I called my decompilation engine IPND, which I wanted to be the fastest was to decompile a java class into readable Java source code.

The Core Architecture (Domain-Agnostic)

At a high level, this project separates policy from execution:

  • Policy lives in program.md (what to optimize, what constraints matter, what constitutes a win).
  • Execution lives in code and tooling (train.py, test harnesses, benchmarks, profiling tools).
  • Decisions are recorded as structured experiment outcomes (baseline vs current).

flowchart LR A[Human defines objective + constraints] --> B[Program spec / playbook] B --> C[Agent proposes code change] C --> D[Run harness] D --> E[Collect metrics] E --> F{Beats baseline?} F -- yes --> G[Keep change] F -- no --> H[Discard/Revert] G --> I[Update baseline + log] H --> I I --> C

Three design choices make this robust across domains:

  1. Fixed evaluation protocol: same benchmark shape each iteration, so comparisons stay valid.
  2. Explicit baseline: every candidate is judged relative to a known reference, not gut feeling.
  3. Tight loop latency: faster iteration means more hypotheses tested per hour.

In ML, the metric is validation bits-per-byte. In decompiler work, the metric can be latency, memory, correctness parity, or all three.

Mapping the Architecture to the Decompiler Project

For the decompiler, I used the same loop but swapped in system-level targets:

  • Performance targets: class and jar decompilation latency.
  • Memory targets: RSS/HWM behavior under realistic workloads.
  • Correctness guardrails: API tests, decompiler tests, parity checks.
  • Output contract: stable API behavior and artifact format.

The implementation already has natural seams for this:

  • crates/core: parsing, decompilation, emit logic.
  • crates/api: HTTP contract, job orchestration, artifact generation.
  • crates/cli and crates/ffi: integration surfaces.

flowchart TD C1[Client: class/jar upload] --> A1[API routes] A1 --> A2[Auth + input normalization] A2 --> A3{Class sync path or Jar async job} A3 -->|Class| C2[Parse classfile + emit source] A3 -->|Jar| J1[Create job + persist upload] J1 --> J2[Worker decompile loop] J2 --> K1[Core parser/decompiler/emitter] K1 --> Z1[Artifact ZIP + SUMMARY.md] Z1 --> R1[Status + download endpoints]

This is exactly what makes the architecture reusable: once a system has deterministic entry points and measurable outputs, it can be optimized by the same research loop regardless of domain.

How I Used It for Decompiler Logic Improvements

The practical cycle looked like this:

  1. Establish baseline with fixed corpus and repeat count.
  2. Profile CPU and memory hotspots (perf, heaptrack, runtime summaries).
  3. Hypothesize a change (for example: zip writer mode, decompile path behavior, branch coverage for edge cases).
  4. Patch and validate with tests and coverage gates.
  5. Re-benchmark and compare against baseline.
  6. Keep only measurable wins.

flowchart LR B[Baseline run] --> P[CPU + memory profiling] P --> H[Hotspot hypothesis] H --> X[Code change] X --> T[Tests + coverage] T --> R[Benchmark rerun] R --> D{Latency/memory better and correctness intact?} D -- yes --> K[Keep + document delta] D -- no --> N[Drop/iterate] K --> B N --> H

This gave us concrete, decision-ready metrics instead of anecdotal “feels faster” claims. Example outcomes from the class decompilation track:

  • Mean latency improved versus baseline in repeated top-class comparisons.
  • Memory improved in the same benchmark family, with process-level peak RSS/HWM reductions.
  • Coverage was raised and enforced with cargo llvm-cov --fail-under-lines 90 on the API package, so optimization work did not erode reliability.

Notable Speed Changes With Code Examples

Below are concrete code-level changes that helped performance in the decompiler path.

1) Parallelize method decompilation only when class size justifies it

In crates/core/src/emit/mod.rs, method bodies are decompiled in parallel only for sufficiently large classes. Small classes stay serial to avoid scheduler overhead.

fn should_parallelize_method_decompile(coded_method_count: usize, total_code_bytes: usize) -> bool {
    coded_method_count >= 24 && total_code_bytes >= 12_000 && method_decompile_parallelism() > 1
}

if should_parallelize_method_decompile(coded_methods.len(), total_code_bytes) {
    let results = coded_methods
        .par_iter()
        .map(|(method_index, method)| {
            (*method_index, crate::decompile::decompile_method_v1(class, method, *method_index, decompile_opts))
        })
        .collect::<Vec<_>>();
    // write back results...
}

Why it matters:

  • The threshold gate prevents regressions on tiny classes.
  • For big classes (many coded methods), this unlocks multicore throughput where most time is spent.
  • Thread pool reuse (OnceLock<ThreadPool>) avoids rebuilding pools per class.

2) Replace map-heavy method body storage with indexed slots

The emitter path uses vector-indexed storage for method bodies and moves values out with take(), reducing lookup and clone overhead.

let mut method_bodies: Vec<Option<crate::decompile::MethodBody>> = vec![None; class.methods.len()];
// fill method_bodies[method_index] = Some(body)

let body = method_bodies
    .get_mut(method_index)
    .and_then(|slot| slot.take());

Why it matters:

  • Indexing by method index is O(1) and cache-friendly.
  • Option::take() avoids repeated cloning of large body structures.
  • This directly improves the hot emit path for classes with many methods.

3) Add no-allocation fast paths in identifier rewriting

String-rewrite utilities now bail out immediately when there is nothing to replace, instead of always allocating an output string.

fn replace_identifier_all_if_needed(source: &str, from: &str, to: &str) -> Option<String> {
    if from.is_empty() || from == to { return None; }
    if !source.contains(from) { return None; }
    // rewrite only if needed...
    Some(out)
}

Why it matters:

  • Large decompiled method bodies often do not need renaming rewrites.
  • Avoiding unnecessary allocations cuts CPU and memory churn in post-processing.
  • This is a classic hot-path optimization with low risk and high frequency payoff.

4) Optimize artifact ZIP write path for throughput

For output packaging, I moved to low-cost compression by default and made “stored” mode configurable for memory-sensitive runs.

let file_options = if use_stored_artifact_entries() {
    SimpleFileOptions::default().compression_method(CompressionMethod::Stored)
} else {
    SimpleFileOptions::default()
        .compression_method(CompressionMethod::Deflated)
        .compression_level(Some(1))
};

Why it matters:

  • Deflate level 1 significantly reduces compression CPU cost versus higher levels.
  • Stored mode can further reduce transient memory pressure when compression is not needed.
  • This affects jar decompile end-to-end latency because artifact creation is on the critical path.

What was measured

On our class benchmark track (largest-class/top-N comparisons), current vs baseline showed sustained improvements:

  • Top-10 class retest: mean latency delta -18.44% and weighted delta -24.67%.
  • Randomized top-10 sample from top-50 classes (two-run aggregate): mean delta about -16%, weighted delta about -27%.

These changes were only kept when they held against baseline under the same harness and passed the regression tests.

Current Full-Jar Decompiler vs CFR Numbers (Current Checkout)

To compare full jar decompilation (not per-class microbenchmarks), I used the same input jar for both tools:

  • Jar: external_jars/commons-lang3-3.14.0.jar (404 classes)
  • Passes: 3
  • Host CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
  • Java: openjdk version "21.0.9" 2025-10-21

Commands used:

# IPND full-jar decompile (API worker path)
IPND_PERF_PASSES=3 IPND_PERF_BUDGET_MS=50 \
cargo test -p ipdn perf_run_decompile_job_common_jar -- --ignored --nocapture

# CFR full-jar decompile (whole-jar invocation, 3 passes)
java -jar dist/tools/cfr-0.152.jar external_jars/commons-lang3-3.14.0.jar --outputdir <tmp> --silent true

Aggregate latency results (full jar)

Slice IPND mean (ms) CFR mean (ms) CFR/IPND ratio
Overall (all passes) 369.440 5998.479 16.237x
Cold pass only (pass 1) 409.150 6397.316 15.636x
Warm passes only (pass 2-3) 349.585 5799.060 16.588x

Supporting percentiles from the same run set:

  • Overall p50: IPND 352.106 ms, CFR 5913.400 ms
  • Overall p95: IPND 409.150 ms, CFR 6397.316 ms

Output artifact context:

  • IPND artifact zip size: 416735 bytes
  • CFR extracted source size: 1173989 bytes across 248 files

Interpretation:

  • On this full-jar workload, current IPND is about 16x faster than CFR on the same machine.
  • Unlike class microbenchmarks, this comparison includes full pipeline cost (jar indexing, decompile loop, output materialization).
  • The per-class CFR benchmark remains useful for instruction-level tuning, but jar-level numbers are the right KPI for end-user jar decompilation throughput.

Why This Architecture Scales Beyond ML and Decompilers

The pattern works anywhere you can define:

  • a controllable change surface,
  • a repeatable run harness,
  • and a strict scoring function.

That includes compilers, API backends, data pipelines, search ranking services, and frontend rendering performance.

The transferable blueprint is:

  1. Define objective as a metric, not a story.
  2. Lock evaluation protocol.
  3. Automate measurement and diffing.
  4. Require objective keep/revert decisions.
  5. Track baseline drift explicitly.

If you do just these five things, “autonomous research” stops being an ML novelty and becomes a general engineering operating model.

Under-the-Hood Components That Matter Most

A lot of teams underestimate this part. The architecture only works when each component is explicit and stable:

  • Objective layer: a single optimization target plus guardrail targets.
  • Mutation layer: where code changes are allowed (for ML here it was primarily train.py; for decompiler work I intentionally touched scoped Rust modules).
  • Execution layer: deterministic commands for tests, perf runs, and memory runs.
  • Evaluation layer: scripts and logs that produce machine-readable deltas (summary|..., memory|...).
  • Decision layer: keep/drop rules that run the same way every iteration.

Without this separation, optimization efforts drift into ad-hoc debugging. With it, every iteration contributes to a cumulative research trajectory.

A Practical Template for Other Projects

If you want to adapt this architecture to a new project, start with a minimal contract:

  1. Pick one benchmark corpus that matches production pressure.
  2. Record one immutable baseline run.
  3. Define pass/fail thresholds for regressions.
  4. Automate one command that prints current vs baseline.
  5. Gate merges on those numbers.

Once that is in place, you can scale out to multi-objective optimization (speed, memory, reliability, quality) without losing control of experiment integrity.

Closing

What started as an ML experiment loop is really a system for disciplined optimization under uncertainty. In our decompiler work, that architecture let us improve speed and memory without sacrificing correctness or API stability. The key was not domain-specific tricks; it was the loop design itself: baseline, profile, patch, verify, compare, repeat.