Now I am spending more time with Python and one thing which I missed was the microbenchmarking tool I had in Ruby: benchmark-ips, so I decided to ask Codex port it from Ruby to Python, keeping the same user experience. I focused on giving Codex the right map: repo structure, the Ruby behavior to preserve, and the expectations for the CLI and reports. The result feels like the same instrument in a new language, and it's surprisingly faithful.
Ruby → Python: The Same API
Ruby original:
require 'benchmark/ips' Benchmark.ips do |x| x.config(time: 5, warmup: 2) x.report("string concat") { "hello" + "world" } x.report("interpolation") { "hello#{world}" } x.compare! end
Python port:
from benchmark_ips import benchmark bm = benchmark.Benchmark() bm.config(time=5, warmup=2) bm.report("string concat", lambda: "hello" + "world") bm.report("interpolation", lambda: f"hello{world}") bm.compare()
Output (identical format):
Warming up --------------------------------------
string concat 200.000k i/100ms
interpolation 180.000k i/100ms
Calculating -------------------------------------
string concat 2.500M (± 2.0%) i/s
interpolation 2.200M (± 1.5%) i/s
Comparison:
string concat: 2500000.0 i/s
interpolation: 2200000.0 i/s - 1.14x slower
The Ruby version's flow—warmup, timed iterations, and steady-state measurement—was treated as a contract. In Python, that contract lives in job.py, job_entry.py, and timing.py. The high‑resolution clock uses time.perf_counter() for monotonic timing, and the loop structure was arranged to mirror Ruby's cadence so the numbers read the same way. I also kept the API surface aligned with the DSL shape: configuration hooks, reporting callbacks, and comparison mode behave in the same spirit, even if the syntax is more explicit.
Timing Engine: Ruby vs Python
Ruby:
# lib/benchmark/ips/timing.rb def measure before = Process.clock_gettime(Process::CLOCK_MONOTONIC) yield after = Process.clock_gettime(Process::CLOCK_MONOTONIC) after - before end
Python:
# timing.py import time def measure(func): before = time.perf_counter() func() after = time.perf_counter() return after - before
Both use monotonic clocks—no wall-clock drift, no NTP adjustments.
Job Execution: Warmup → Benchmark
Ruby:
# lib/benchmark/ips/job.rb class Job def run_warmup cycles = 100_000 while elapsed < @warmup cycles.times { @block.call } end end def run_benchmark iterations = 0 while elapsed < @time @block.call iterations += 1 end @ips = iterations / elapsed end end
Python:
# job.py class Job: def run_warmup(self): cycles = 100_000 start = time.perf_counter() while time.perf_counter() - start < self.warmup: for _ in range(cycles): self.block() def run_benchmark(self): iterations = 0 start = time.perf_counter() while time.perf_counter() - start < self.time: self.block() iterations += 1 elapsed = time.perf_counter() - start self.ips = iterations / elapsed
Keeping the CLI running was the core priority. I preserved the CLI‑style entry points and made sure the "quiet" mode and JSON output were first‑class. That way the tool can be dropped into CI and scripted runs without noisy output. The tests in tests/test_benchmark_ips.py intentionally exercise CLI‑like usage, output formatting, and JSON writing so I could keep the interface stable while the internals shifted.
CLI: Scripted Runs
Ruby:
$ benchmark-ips --warmup 2 --time 5 --quiet my_benchmark.rb
Python:
$ python -m benchmark_ips --warmup 2 --time 5 --quiet my_benchmark.py
JSON Output for CI
Ruby:
Benchmark.ips do |x| x.config(format: :json) x.report("task") { expensive_operation } end
Python:
bm = benchmark.Benchmark() bm.config(format='json') bm.report("task", expensive_operation) bm.run()
Output (identical structure):
{ "results": [ { "label": "task", "ips": 12500.5, "stddev": 250.3, "error_percentage": 2.0 } ] }
Verification was a layered check. First, I ran the existing Python tests to validate output and JSON handling. Next, I compared example runs with the Ruby version using the same benchmark snippets, looking for similar ordering and relative performance—no exact numeric match expected, just the same story. I also used short warmups and small time windows to keep the suite fast while still showing stable behavior. Finally, I added spot checks for reporting deltas and standard deviation formatting to catch subtle regressions.
Real Benchmark: Data Structures
Ruby:
data = (1..1000).to_a Benchmark.ips do |x| x.report("Array#include?") { data.include?(500) } x.report("Set#include?") { Set.new(data).include?(500) } x.compare! end
Python:
data = list(range(1, 1001)) bm = benchmark.Benchmark() bm.report("list in", lambda: 500 in data) bm.report("set in", lambda: 500 in set(data)) bm.compare()
Both show the same story: set membership is dramatically faster.
The end result is not just "ported," it's operational: the CLI keeps running, reports look familiar, and the numbers are trustworthy. Codex did the heavy lift, but the path was deliberate. It's still benchmark-ips—just speaking Python now.