Leveraging OpenTelemetry to Fix Flakiness️ in Integration Tests

At Lumigo, we heavily depend on a set of tests to deploy code changes fast. For every pull request opened, we bootstrap our whole application backend and run a set of async parallel checks mimicking users’ use cases.

We call them integration tests. These integration tests are how we ensure:

New changes we want to deploy actually work
Changes aren’t breaking any existing functionality

Recently, we changed our old “traditional log traversing” of integration tests into *amazing* OpenTelemetry traces graphs like the ones below, and in this blog post, we’ll discuss why and how.

A typical integration test could look like this:

1. Signup to the product
(via the API)

2. Create temporary resources
(temporary Lambda)

3. Do something/ trigger an event — like a user would do
(invoke the temp Lambda and make it crash)

4. Expect results on the other end
(wait for insights to be outputted)

The Problem

Sometimes, the integration tests fail, and not because of the application code (the backend code), but because:

The test code isn’t robust enough (e.g. API schemas aren’t extendable)
Caching
“Race conditions” (the test checks for data before it reaches our data pipelines)
A dependent resource (e.g. temporary slack account) is flaky
- Slow response times causing Timeouts
- Returns an unexpected response seldomly
- Tokens were revoked

Whenever a test failed, we would look up the logs and try to find the problem. When the test logs weren’t informative enough, we checked the logs across all related components to search for the root cause that made our test fail. ????

I’m glad to say we found a better solution to the problem.

OpenTelemetry to the rescue

“OpenTelemetry provides a standard set of APIs, libraries, and SDKs that can be used to instrument your application code and collect data about its behavior”

https://opentelemetry.io/

OpenTelemetery (also known as OTel) is a `lego` of SDKs that gives users the flexibility to trace their app, starting from the code `instrumentors`, and `exporters` and continuing with the “backend”- `collector`, `ingestor` and a database to query the collected traces.