Debugging 100% CPU usage in production Ruby on Rails systems

2 min read Original article ↗

Today I noticed one of our customer containers was running really high on CPU.

# top -bn1 

190 discour+  20   0 2266292 205128  15656 S  86.7  0.3   9377:35 ruby

# ps aux

discour+   190 19.4  0.3 2266292 207096 ?      Sl    2017 9364:38 sidekiq 5.0.5 discourse [1 of 5 busy]

Looks like sidekiq is stuck on a job.

Where is it stuck?

Usually, this is where the story ends and another series of questions start

  • Can we reproduce this on staging or development?

  • What code changed recently?

  • Why is perf trace not giving me anything I can work with?

  • How awesome is my Sidekiq logging?

  • Where is my divining rod?

Julia Evans is working on a fantastic profiler that will allow us to answer this kind of question real quick. But in the meantime, is there anything we can do?

rbtrace + stackprof

At Discourse we include rbtrace and stackprof in our Gemfile.

gem 'rbtrace'
gem 'stackprof', require: false

We always load up rbtrace in production, it allows us a large variety of production level debugging. stackprof is loaded on-demand.

In this particular case I simply run:

# rbtrace -p 190 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'

This injects a new thread that enables stackprof globally and finally writes the performance data to /tmp/perf

This dump can easily be analyzed:

# stackprof /tmp/perf
==================================
  Mode: cpu(1000)
  Samples: 475 (0.63% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       473  (99.6%)         473  (99.6%)     #<Class:0x00007fac08218020>::Typography.to_html_french
         4   (0.8%)           1   (0.2%)     #<Module:0x00007fac080c3620>.reap_connections

In fact, we can even collect backtraces and generate flamegraphs in stackprof, so I strongly recommend reading through the readme.

So there you have it, my culprit is:

text.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3')

A regular expression consuming 100% CPU in production, this has got to be a first :mage: