Tests Should Build Confidence | domk.website

17 Jan 2021 / ~6 min

In automated software testing, the default approach among developers is bottom up, and the aims are high coverage and working software.

This approach misunderstands the goal of testing and often fails to deliver on the goal of reliably shipping working software.

The idea of the bottom up approach is that you use tests to show that individual parts are correct, that they integrate, and that your system is correct as a result.

This approach leads to the typical test pyramid with many tests for small parts at the bottom (unit), fewer tests of larger parts in the middle (integration) and even fewer broad tests at the top (end-to-end). The tests at the bottom are small and quick and the tests at the top are broad and slow. ¹

The goal of this approach is “working software”, implying focus on correctness. To be successful, the test coverage has to be nearly complete. ² We evaluate each test by whether it increases coverage and adds to the evidence of correctness.

However, if we consider the purpose of tests with nuance, we can find a better approach.

The goal of a working software engineer isn’t to show the system is correct, it is to be confident that the system is working as expected. ³ That’s why we write tests.

We don’t just want to produce working software, we want to be confident that the software is working. It’s not enough for the software to work, we have to know it. That confidence is what lets us evolve our software to meet the needs of its users without unnecessary stress and delays, and confidence should be the goal.

With that goal in mind, the way to evaluate a test is by how much it increases our confidence that the system is working.

Consider this thought experiment: we have a piece of software with no tests, and our goal is convince ourselves that it works.

The first thing to do — the thing that would increase our confidence the most — would be to run it and check that it doesn’t crash.

Next, we would supply different inputs and check the outputs. If we have a UI, click around and take some actions. After a while, we might not have convinced ourselves that the system behaves correctly in all edge cases, but we know that it works.

Contrast that with the traditional pyramid approach.

We would start by adding unit tests for individual functions and modules, building up a comprehensive test suite from the bottom up. After a while, we know that individual modules work correctly in all edge cases, but we don’t know whether the program crashes when we run it.

While the example is contrived, the reality of many teams isn’t that different. They write many tests in the name of following best practises, but when it comes time to deploy something they aren’t confident.

When we look at their test suites, we find common problems.

As the pyramid prescribes, they have good unit test coverage. The trouble begins with the integration tests.

Because the good practise is to test each unit in isolation, internal dependencies are mocked. Because external dependencies like databases and APIs are hard to test against and make the tests slow, they are mocked or skipped. ⁴

Because running end to end tests with a moderately complex service that uses a database, calls other services and uses external APIs is hard, it is not done. And admittedly it is hard. Working with external dependencies makes tests hard to set up, it makes them slow and it makes them flaky. ⁵

The cost of this approach is great. When the tests pass, no one can be sure that the system won’t crash on startup in production or that it won’t throw errors when accessing the database. They get around this by running the system locally or doing manual tests in staging, but it is a red flag.

Another red flag is writing tests that feel tedious or pointless. Viewed through the lens of building confidence, if a test passing doesn’t make you any more confident about the final product working, there’s no reason to add it. ⁶

Following the confidence-building approach, a typical test suite looks like an hourglass not a pyramid.

We still have many small units tests. We want to be confident that we’ve covered all the edge cases in our logic, and testing those is easiest closest to the source. ⁷ In this layer we also include tests that call the database because we want to test that integration thoroughly. ⁸

On the other end, we have many end-to-end tests.

For a web service, those would be tests that use a browser to access the service as a user would. For an API, they would be calling it as a client. We use live dependencies for everything unless technically impossible (like a payment gateway that doesn’t have a test mode — in that case you should find a better one).

The end-to-end tests should aim to exercise every feature at least once. In dynamic languages, you should be running all your code to make sure there aren’t any crashes. If you have a UI, it has to be tested.

Absent is much of a middle layer. In my experience, the ultimate test of modules “integrating” is that they work in an end-to-end scenario. Testing them beyond that doesn’t provide additional confidence.

One downside of this approach is that confidence isn’t easy to measure, but there are questions we can ask. Would you be nervous about your system getting deployed or shipped when your tests pass? Are there any (implicit) manual testing steps? Do you have to watch for problems after every deploy?

If you answered yes to any of these, you aren’t confident. Think about what tests would increase your confidence, what would you check manually? Automate it.

What is the right level of confidence? This is my favourite test: would your tests make you confident enough to deploy at 5pm on Friday, close your laptop and go home?