A framework for methodically debugging any bug

Introduction

When I first started in software development, I had a very ad-hoc approach to debugging. I’d basically either (1) try to reproduce the issue locally and add print statements or (b) Google.

It’s no surprise this approach didn’t work in the face of previously unsolved bugs, or unreproducible bugs, or non-code related issues (like library dependency conflicts and Docker build failures).

Eventually, I developed a systematic approach which has helped me considerably. I realized that while debugging is a fact of life as a software engineer, if you get good at it, you can not only save yourself countless hours of time — it can become quite enjoyable as well.

1) Understand the problem thoroughly

Some things to consider or do:

Read the error message in depth
What’s the scope of the problem? Who is and isn’t affected?
What’s the sequence of steps leading up to the problem?
When did the problem start happening?

2) Apply heuristics

Let’s say you’re getting a “connection refused” error when trying to connect from your backend API to your Postgres database in production in AWS.

Some heuristics you could try:

Google the error message (or check Stack Overflow or ask ChatGPT)
Temporarily remove the firewall rules for the Postgres database to see if that fixes it
Connect to the database locally or from another machine
Connect using a different database client

Heuristics are really just educated guesses, guided by your intuition. Often there’s no logic to them. Why would restarting the API fix this issue? But often it does. Most issues are quite simple, or common, and so heuristics are sufficient.

Some more heuristics that may help:

What changed recently (or since it last worked)? See if any recent code, configuration, or infrastructure changes may have caused this issue. Running git bisect can help.
Restart or re-create the component like the server or database

2) Reproduce locally and simplify

Reproducing the issue locally is often 80% of the work. Some techniques:

Record the failing curl request: For APIs, recording the failing curl request captures all aspects of the HTTP request, and makes it really easily to reproduce (in your terminal or by importing into Postman)
Connect to real dependencies: Obviously, this is difficult because you probably won’t have access to production databases, message brokers, etc. and even if you do, they may not be reachable from your local machine
Use the same code version, data, configuration locally, compared to the failing request. This is also easier said than done.

Once you’ve reproduced, it’s helpful to simplify the failing input and code as much as possible. Some techniques:

Writing a failing unit test case: It’s much easier to run a debugger on a unit test case than a HTTP server. The feedback loop is faster as well.
Isolating the failing code: You can copy the failing method into a new executable program. The issue might disappear altogether. If not, your feedback loop will be faster (and running your debugger will be easier)
Make calls using a different tool: As a example, if your code which attempts to connect to the database is failing, try to connect to the database using the Postgres client. This isolates the call from your code.

3) Understand the system

The goal here is to understand the different relevant components involved in the system.

For instance, for a login issue, the components might be a SPA frontend, a load balancer, a backend service, and a NoSQL database (and the network links between them).

From there you would still need to understand how the components interact with each other, what data they send and which interfaces they call.

Some techniques that can help:

Read the documentation
Look at the interfaces for any components you identify
Read the relevant source code
Read any relevant source code of third party dependencies

4) Instrument and rule out one component at a time

We can then proceed to rule out one component at a time. So for instance, we might rule out the frontend because it’s making the right call to the API, and the load balancer because it’s forwarding the request correctly to the API.

Once we identify it’s a backend API issue, we can then start repeating steps 3 and 4 at the API service layer. We can understand the API’s components and rule out one module, then class, then method, then line of code until we get to the lines of code responsible for the issue.

Ideally, we want to bisect, so checking one hypothesis actually rules out half the components at any given time. So we’re not actually checking one component at a time, we’re checking the component halfway through the request flow repeatedly, as a form of binary search.

How can we instrument? It of course depends on the component at hand.

If we want to instrument a Kafka topic, we could add a consumer service which logs each message sent to the topic.
If we want to instrument network traffic, we could use Wireshark or ngrep (L4) or a HTTP proxy (L7)
If we want to instrument our code, we could add logs or assert statements.

See the section at the end on Debugging Tools for more on finding the right tool to use to gather data from your component.

5) Fix the problem

If you get to this point, fixing the problem should be quite easily. Unfortunately, I don’t have any other tips to offer you on this at the moment.

Examples

Example 1: API -> database connection failure

Let’s illustrate using the example above.

Understand the problem: We’ve read the error message, which is a timeout error from our database client library. This suggests our request is not reaching the database at all but we don’t know for sure.
Reproduce locally and simplify: Unfortunately it’s not possible for us to reproduce locally in this case as our production database is not exposed to the Internet.
Understand the system: We can draw out a diagram with our server, our database, the AZs and regions they run in, their VPCs, any reverse proxies, etc. Of course, the more detail the better. This may require researching AWS.
Rule out one component at a time: Our bisecting hypothesis might be “our database connection is being blocked by the database’s security group firewall”. We can then look at the API’s logs to confirm the request is being made. We can look at the database’s logs to confirm the request is not reaching the database. We can set up VPC Flow Logs to see if the request is reaching the security group firewall.

We can repeat this last step until we drill down on the issue.

For instance, say the API is attempting the connection, but it is not reaching the security group firewall. We can then bisect, hypothesize that the request is not leaving the API’s VPC, and repeat.

Example #2: Library dependency conflicts

Let’s say you’re installing a new library in your Go project, but it’s conflicting with another library. The error message is obtuse.

Some heuristics you could apply are Googling the error message, clearing your local Go cache, seeing if you can remove one of the two dependencies, or running go mod tidy.

If none of this works, we can:

Understand the problem: Re-read the error message, read online about what causes it or search for the error in the Go modules source. Learn how Go modules works internally. Be able to precisely summarize what’s causing the error in this specific case in a few sentences.
Reproduce locally and simplify: If helpful, you could create a minimal program with just the conflicting dependencies
Understand the system: Most of this was already done in Step #1
Rule out one component at a time: Run tools like go mod graph, go mod whyand goda to inspect the state of your project’s packages. You can also observe how Go modules stores packages on disk.

Get organized

It can be difficult to keep all your thoughts and ideas organized when debugging something. Some techniques:

Draw out a diagram
Keep a log: what did you try and what was the outcome
Write out the problem, as if you’re explaining it to someone to get their help. In other words, rubber duck debugging.

Ask for help

Ultimately, if you’re not able to debug a issue after a long enough time, you’d be better off asking the right person for help, whether inside or outside of your organization. You could use git blame to find the relevant people for a given people of code.

I don’t have any further suggestions on this either at the moment but it shouldn’t be too hard.

Debugging tools

Finding the right debugging tool can be the difference between spending hours and minutes debugging.

That’s why I strongly believe that it’s worthwhile to look on Google or GitHub for the top specialized debugging tools for whatever you need to debug.

For instance, I recently found docker-debug and buildg, which are great tools for debugging Docker containers and Dockerfile’s, respectively.

Some types of debugging tools I’ve found especially helpful are:

REPLs — in Python, you can drop a import pdb;pdb.set_trace() anywhere in your code to get a interactive interpreter
Debuggers — especially their more advanced features, like logpoints, watchpoints, conditional breakpoints
git bisect — to narrow down which commit introduced a bug

Debugging tools I’ve been looking into:

API proxies — which let you view and rewrite HTTP requests in flight
Database query plan analyzers — which help you understand why some queries were fast. Example: PgAnalyze
Live debugging, like LightRun — let you add logs and view variable values to your code in production
Remote debugging — connecting your debugger to a running code instance. Especially useful for HTTP servers
Time travel debugging — using rr

Design for debuggability

I’ve found designing your program to be easily debuggable invaluable, but rarely discussed. Often this discussion stops at adding logs, but there’s much more to it.

The fact is, as a developer, we will often have minimal access to the production environment, so it’s incumbent on us to expose any information we need for effective debugging via endpoints or logs. The alternative is wasting a lot of time going back and forth with the cloud team.

The way I think about it is that there’s four main things that vary between a local and production environment. We can instrument our code to give us visibility into each.

Code: We can add a endpoint to get the running code’s Git commit hash.
Configuration: We could log the configuration upon startup (minus any secret fields).
Infrastructure: For databases in particular, we could log the applied database migrations
Data: There’s not much we can do for this, as far as I know

I also try to snapshot as much of the program’s state as I can when it runs into a issue. So for instance, in a web scraping job, I would screenshot web pages which we failed to scrape successfully. That way, even if the page changes later on, I can identify why the job failed.

Tactical tips

Compare against working code. This might include finding a working solution off of GitHub or developing one in a different language and comparing the code.
Make sure your modified code is running. Often times the print statements you add are not triggered, because the new version of your code is not actually running
Avoid guessing or thinking, instead add logs and verify
Efficient to use debugging tools make a big difference. One example I’ve noticed is using Prisma Studio instead of PgAdmin — it’s much faster to use for browsing data in your database
Add sufficient logs so you can follow the code’s flow for any request. Also, it’s worthwhile to log the inputs and outputs for any external calls you make.
Inspect state, like in-memory or on-disk data structures or databases or caches— do they contain the right data?
Similarly, if your program is reading data from somewhere else, look at the source of truth data source. An example would be looking at the config values in Vault, then inspecting the environment variables set in the container, then logging the config read by the program.

Acknowledgements

Julia Evans — The Pocket Guide to Debugging — I learned many of the techniques above from this book. Highly recommend.
A systematic approach to debugging — I re-read this after writing this post, and found it aligned closely with my post.
How to become a debugging master and fix issues faster (Taro)