I ported benchmark-ips from Ruby to Python

4 min read Original article ↗

Now I am spending more time with Python and one thing which I missed was the microbenchmarking tool I had in Ruby: benchmark-ips, so I decided to ask Codex port it from Ruby to Python, keeping the same user experience. I focused on giving Codex the right map: repo structure, the Ruby behavior to preserve, and the expectations for the CLI and reports. The result feels like the same instrument in a new language, and it's surprisingly faithful.

https://github.com/gogainda/benchmark-ips-python

Ruby → Python: The Same API

Ruby original:

require 'benchmark/ips'

Benchmark.ips do |x|
  x.config(time: 5, warmup: 2)
  
  x.report("string concat") { "hello" + "world" }
  x.report("interpolation") { "hello#{world}" }
  
  x.compare!
end

Python port:

from benchmark_ips import benchmark

bm = benchmark.Benchmark()
bm.config(time=5, warmup=2)

bm.report("string concat", lambda: "hello" + "world")
bm.report("interpolation", lambda: f"hello{world}")

bm.compare()

Output (identical format):

Warming up --------------------------------------
     string concat   200.000k i/100ms
     interpolation   180.000k i/100ms
Calculating -------------------------------------
     string concat     2.500M (± 2.0%) i/s
     interpolation     2.200M (± 1.5%) i/s

Comparison:
     string concat:  2500000.0 i/s
     interpolation:  2200000.0 i/s - 1.14x slower

The Ruby version's flow—warmup, timed iterations, and steady-state measurement—was treated as a contract. In Python, that contract lives in job.py, job_entry.py, and timing.py. The high‑resolution clock uses time.perf_counter() for monotonic timing, and the loop structure was arranged to mirror Ruby's cadence so the numbers read the same way. I also kept the API surface aligned with the DSL shape: configuration hooks, reporting callbacks, and comparison mode behave in the same spirit, even if the syntax is more explicit.

Timing Engine: Ruby vs Python

Ruby:

# lib/benchmark/ips/timing.rb
def measure
  before = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield
  after = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  after - before
end

Python:

# timing.py
import time

def measure(func):
    before = time.perf_counter()
    func()
    after = time.perf_counter()
    return after - before

Both use monotonic clocks—no wall-clock drift, no NTP adjustments.

Job Execution: Warmup → Benchmark

Ruby:

# lib/benchmark/ips/job.rb
class Job
  def run_warmup
    cycles = 100_000
    while elapsed < @warmup
      cycles.times { @block.call }
    end
  end
  
  def run_benchmark
    iterations = 0
    while elapsed < @time
      @block.call
      iterations += 1
    end
    @ips = iterations / elapsed
  end
end

Python:

# job.py
class Job:
    def run_warmup(self):
        cycles = 100_000
        start = time.perf_counter()
        while time.perf_counter() - start < self.warmup:
            for _ in range(cycles):
                self.block()
    
    def run_benchmark(self):
        iterations = 0
        start = time.perf_counter()
        while time.perf_counter() - start < self.time:
            self.block()
            iterations += 1
        
        elapsed = time.perf_counter() - start
        self.ips = iterations / elapsed

Keeping the CLI running was the core priority. I preserved the CLI‑style entry points and made sure the "quiet" mode and JSON output were first‑class. That way the tool can be dropped into CI and scripted runs without noisy output. The tests in tests/test_benchmark_ips.py intentionally exercise CLI‑like usage, output formatting, and JSON writing so I could keep the interface stable while the internals shifted.

CLI: Scripted Runs

Ruby:

$ benchmark-ips --warmup 2 --time 5 --quiet my_benchmark.rb

Python:

$ python -m benchmark_ips --warmup 2 --time 5 --quiet my_benchmark.py

JSON Output for CI

Ruby:

Benchmark.ips do |x|
  x.config(format: :json)
  x.report("task") { expensive_operation }
end

Python:

bm = benchmark.Benchmark()
bm.config(format='json')
bm.report("task", expensive_operation)
bm.run()

Output (identical structure):

{
  "results": [
    {
      "label": "task",
      "ips": 12500.5,
      "stddev": 250.3,
      "error_percentage": 2.0
    }
  ]
}

Verification was a layered check. First, I ran the existing Python tests to validate output and JSON handling. Next, I compared example runs with the Ruby version using the same benchmark snippets, looking for similar ordering and relative performance—no exact numeric match expected, just the same story. I also used short warmups and small time windows to keep the suite fast while still showing stable behavior. Finally, I added spot checks for reporting deltas and standard deviation formatting to catch subtle regressions.

Real Benchmark: Data Structures

Ruby:

data = (1..1000).to_a

Benchmark.ips do |x|
  x.report("Array#include?") { data.include?(500) }
  x.report("Set#include?") { Set.new(data).include?(500) }
  x.compare!
end

Python:

data = list(range(1, 1001))

bm = benchmark.Benchmark()
bm.report("list in", lambda: 500 in data)
bm.report("set in", lambda: 500 in set(data))
bm.compare()

Both show the same story: set membership is dramatically faster.

The end result is not just "ported," it's operational: the CLI keeps running, reports look familiar, and the numbers are trustworthy. Codex did the heavy lift, but the path was deliberate. It's still benchmark-ips—just speaking Python now.