Speeding up our Rails CI with Github actions and parallel_tests

Summary

This article outlines how we used parallel_tests and Github Actions to bring our CI times down from 25 to 10 minutes. It starts with an overview of our previous setup and its challenges, we them move on to setting requirements, the research we conducted, and the proof of concepts we tried out. We then discuss the implementation phase, highlighting some decisions we took along the way. The experiences and strategies shared in this article may hopefully offer useful ideas for you to enhance your own CI process.

by Markos Fragkakis, Staff Engineer @ Workable

Press enter or click to view image in full size

Photo by logojackmowo Yao on Unsplash

Background

Let us share some background before we dive into the actual story.

Our repository

The Applicant Tracking System, or shortly, ATS, is the main software Workable builds, and enables thousands of teams over the world to do their hiring.

ATS was the only repository in Workable for years, until we started carving out parts of it as other services. Despite “losing” parts of the business logic, ATS kept growing in lines of code and number of tests. Below you can see this increase over the years.

Press enter or click to view image in full size

Some readers may be wondering why we haven’t broken our monolith more. Rest assured we are working on some significant extractions that will be visible. In any case, it is always beneficial for a CI pipeline to be able to scale. If you are interested in knowing more about our architecture, you can watch the great talk our VPs of Engineering gave at Voxxed Days.

Our test suite

In Workable we follow the test pyramid approach, which was introduced by Mike Cohn in his book Succeeding with Agile (2009).

Press enter or click to view image in full size

Following the Rails way, the ATS tests reach the DB, which is what a non-Rails developer would call integration tests. This post doesn’t aim to advocate for any particular side, so the specific names of the pyramid’s layers aren’t the focus. Instead, it’s about the fundamental concept of the pyramid approach, which involves having a smaller number of costly, slower, high-level tests at the top, and more layers of numerous, less expensive, faster tests beneath.

The tests in our repository are the lower layers. To be specific, our suite contains:

~25K unit tests
~100 Rspec tests
~100 Elasticsearch integration tests

The above have the following dependencies:

Most of them need the DB
~200 need Redis
~100 need Elasticsearch

Currently it takes about 40 minutes to run all the tests sequentially on an M1 pro laptop using a single process (>95% of the time is spent on unit tests). So running all the tests locally is not something a developer would typically do.

Our previous CI pipeline

Our previous CI pipeline was built on Jenkins and Kubernetes several years ago.

Infrastructure-wise, we used the kubernetes plugin to run jobs on agents in a Kubernetes cluster. The agent requested a lot of resources (10 CPU, 10Gb Ram) and shared them among the containers it started:

Our base image
Postgres
Redis
Elasticsearch

CI-wise, the pipeline ran all the tests upon Pull Request, and upon merges on our main branch. In the case of Pull Requests, it also linted the code using Danger and Rubocop, annotating the PR with useful warnings. In the end, it posted a message in Slack with links to the test results:

Press enter or click to view image in full size

In the beginning, when the codebase was not too big, our CI was reasonably fast (~10 minutes). Over time, it grew in maturity and rarely needed maintenance. However, as the team grew, so did the number of tests and the time it needed to run, exceeding 25 minutes. Furthermore, the number of times it was executed grew to over 1000 per month.

Over time we did several, often successful efforts to optimize our CI. We checked out our code by shallow cloning. We implemented caching for our gems in an NFS directory. We parallelized the execution of tests using the parallel_tests gem (more on that later). We also scaled up our agent and did other improvements here and there.

But at the end of the day, all the tests were executed on a single agent, and the average time to run CI kept increasing, along with the number of tests.

The obvious problem with long CI times is the long feedback loop between the time the developer commits and the time they get the test results. In addition, we had other problems with our CI setup too. Parts of the logic in pipelines were hidden in shared libraries maintained by other teams. Also, the documentation of Jenkins is not ideal. As a result, tweaking or troubleshooting our CI pipeline became a chore.

So we decided to rethink our CI.

Phase 1: Requirements

We started out by setting some requirements for our new CI.

Production parity: It should run on the same OS as production to avoid failures in production while the CI passes.

Speed: It should be fast so that we have a short feedback loop.

Cost: It should have reasonable cost.

Development experience: It should be easy for developers to understand and maintain.

Scalability: It should be straightforward to scale to accommodate a larger number of tests.

This is how our previous CI did against the requirements (not very well):

Production parity: ✅ The Jenkins agent used the same base image as our production image, so the tests ran in a system that was almost identical with production.

Speed: ❌ It took 25 minutes to run the tests.

Cost: ✅ ~$650 per month

Development experience: ❌ The pipeline was difficult to maintain and tweak.

Scalability: ❌ The pipeline was created with scaling up in mind. There was parallelization with the parallel_tests gem, but again, everything ran on a single Jenkins agent and was bound by it.

Phase 2: Research

In this phase we assessed our current toolbelt, as well as approaches we could use for our new CI.

Parallel_tests

As mentioned before, we were already using the parallel_tests gem in our previous CI pipeline. Let’s take a better look at what the gem does.

The basic offering is that it breaks the test suite into a number of “equal” groups and runs them in parallel, using different processes. Each test will end up in a single group, and each group will run in a separate process, with its own database. This makes the processes completely independent of each other.

Press enter or click to view image in full size

You can choose the criterion to create “equal” groups with. The default criterion, and the best one for our case, is runtime. To do this, the gem creates a runtime log of all the tests, which you must store and use as input in subsequent runs. For example, if your tests take 20 minutes in total, you could use the gem to split them in 4 groups of 5 minutes runtime (approximately) and run them in parallel.

Press enter or click to view image in full size

These are the commands we use to the gem (N is the number of groups we want to use). The commands correspond to the standard Rails commands for creation, DB setup and execution of the tests:

bundle exec rails parallel:create[N]
bundle exec rails parallel:setup[N]
parallel_test test -n N --group-by runtime --runtime-log tmp/parallel_runtime_test.log

Github actions

Being on Github, and since other teams had been using Github Actions (Github’s offering for automations), it only made sense for us to consider it too. Github actions lets you define workflows, which consist of one or more jobs, each executed by a runner. Each job has one or more steps, for which you may use one of pre-built actions, or define your own.

Github actions quickly scored several points, like having great online documentation and lots of literature about it. They are widely adopted, and they boast a big ecosystem, including a rich marketplace of ready to use actions.

The workers in the context of Github actions are called runners. Currently there are the following types:

Github-hosted runners

Github-hosted runners are machines hosted on Github. They have specific hardware (currently 2 CPU cores, 7Gb RAM), and the only Linux flavor the7 support is Ubuntu. You are charged by the minute of usage. If you have an idea of the computing time you will need, you can use the Github pricing calculator to estimate the cost.

Recently Github introduced larger runners with additional hardware setups (and different pricing), but they were not available at the time of this exercise.

Self-hosted runners

Github also allows you to use self-hosted runners, in which case you maintain the runners, and Github only triggers the workflows. With self-hosted runners, there are more options on what underlying platform to use. Despite that, the list is not exhaustive (ie Alpine is not currently supported). There are tools for autoscaling self-hosted runners on EC2 and k8s. With self-hosted runners, Github doesn’t charge anything — you pay for your runners, wherever you host them.

Actions

With Github actions you get to use several of the pre-built actions that are available on the marketplace. The most important ones are maintained by Github, and they are optimized for their environment. Below is a comparison of how much time some standard steps took in our previous CI pipeline vs Github actions.

Press enter or click to view image in full size

Having spent significant time spent to achieve that 3m 35s, we were seriously impressed with the 21s. Well done, Github Actions, you have our attention.

The matrix strategy

From the beginning we wanted a solution that could scale out, instead of scaling up. That would allow us to throw more computing power to the problem in case the test suite grew more, keeping the total runtime stable.

Knowing how the parallel_tests gem works, breaking the suite into “equal” groups, we could use a number of runners, and assign a number of groups to each. Since the groups are “equal”, all the runners should finish at roughly the same time. We can then combine their results into a unified test report.

This is doable in both Github Actions (link) and Jenkins (link), using the matrix strategy. The matrix lets you use variables from a job to automatically create multiple job runs that are based on the combinations of the variables. For example, you can use a matrix strategy to test your code in multiple versions of a language or on multiple operating systems. In our case, the matrix would be used to assign a number of test groups to a runner.

This would change the pipeline from using a single big runner (left) to multiple smaller runners (right), which we could increase at will. The diagram below shows an example, where we would have N test groups, and we would assign 3 test groups to each runner:

Press enter or click to view image in full size

Phase 3: Proof of concepts

We then went on to implement proof of concepts using both a single large runner, and multiple runners, on both Jenkins and Github actions. In the case of multiple runners, we used the matrix strategy approach to assign a number of groups to multiple workers.

Get Workable Tech Blog’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

For our experiments, we decided to only run the unit tests, Elasticsearch and Rspec tests, since this is the part of CI we wanted to benchmark. The creation of a test report and linting were left out for now.

Github Actions with Github-hosted runners

Since the standard runners of Github-hosted runners are small, we didn’t bother to run all the tests in a single runner. We only tried out the matrix strategy on multiple runners. This is what the workflow looked like:

Press enter or click to view image in full size

You will notice there is a separate runner for integration tests. This is because the Elasticsearch container took ~30s to start, so we quickly decided to run those tests in a separate runner, and only start Elasticsearch on that one. The RSpec tests were executed there too. The matrix strategy shares the unit tests among the runners called Unit test runners. When all the runners finish, we start a runner to combine all the partial results.

The variables of the experiment were the number of unit test runners and the number of groups assigned to each runner. The table below contains our measurements. The runtime column reports the time it took for the suite to finish. This is not what Github will charge us for, since we are using multiple runners in parallel. The time we will be charged for is in the Billing Time column. The lower both times are, the better.

Press enter or click to view image in full size

Github Actions with self-hosted runners (EC2)

For this experiment we used EC2 to host a pool of runners. Since with self-hosted runners you can specify your hardware, we tried both big and small runners.

Multiple, small runners

We first tried the workflow we tried previously on self-hosted runners. We approximated the resources of the standard Github runners. The table below contains our measurement. Notice that the Billing time is missing, since it is EC2 that bills us for the resources:

Press enter or click to view image in full size

Single, big runner

Then we tried considerably larger runners, and ran all the parallel_tests groups on a single runner, much like our previous CI implementation.

Press enter or click to view image in full size

We tried it with two different numbers of groups. The table below contains our measurements:

Press enter or click to view image in full size

Jenkins with matrix strategy

To do the experiment on Jenkins with multiple runners, we modified our previous CI pipeline to use the matrix strategy. This is what the workflow looked like now:

Press enter or click to view image in full size

You will notice that now we needed an initial agent to start the others. Also, Jenkins didn’t need any agent to combine the results, it was able to pick up the partial test results from all the agents and present them.

Since Jenkins allows for customization of resources at the container level, we experimented with those settings too. We won’t dive into a lot of detail here. The best time we achieved was 15m 51s, with a setup of 4 agents for unit tests with 3 groups each. Our observation was that no matter how fast the tests were executed, the pipeline was slow because of the slow checkout and dependency installation (gems and packages).

Phase 4: Assessment

Let’s see how each proof of concept fared against our initial requirements.

Github Actions with Github-hosted runners

Production parity: ✅ The runner we used was Ubuntu-based, same as our production image.

Speed: ✅ <11 minutes without optimizations

Cost: ✅ Using the calculator, we calculated the cost to be the same as our previous CI

Development experience: ✅ Very good

Scalability: ✅ This is where the matrix strategy shines. Regardless how much the number of tests increases, we will always be able to break the suite into more groups, keeping the total time stable. This, combined with Github Actions, with their optimized checkout and dependency management, works like a charm.

Github Actions with self-hosted runners (EC2)

The same workflows that run on Github-hosted runners can also be executed on self-hosted runners with minimal changes. However, there are some important gotchas we found (and probably more we haven’t found):

The ruby/setup_ruby action doesn’t guarantee working on self-hosted runners (link). There are best practices for you to make it work, but the documentation doesn’t guarantee it will. In our experiments below, it did work, but it’s not officially supported. The same may be the case with other actions too. We considered this an important red flag against self-hosted runners altogether.
Another thing to note is that in our experiments our SRE team provided some runners in EC2. The filesystem of these runners was persistent, filling the disk after some runs, so we had to add steps to do cleanup (this will probably won’t be a problem if we used kubernetes with its ephemeral filesystem).

In both multiple small runners and single large runner, the results were the same:

Production parity: ✅ The runner we used was Ubuntu-based, same as our production image.

Speed: ❌ ~20 minutes

Cost: ✅ We calculated the cost to be the same as our previous CI

Development experience: ❌ Very good. BUT, some official actions are not guaranteed to work on self-hosted runners, which may turn the paradise into hell in the future.

Scalability: ❌ Neither approach was scalable. Multiple, small runners had poorer performance compared to Github-hosted runners. The single, big runner could only scale up.

Jenkins with matrix strategy

Production parity: ✅ We used for the agent the same base image as our production image

Speed: ❌ >15 minutes

Cost: ✅ We calculated the cost to be the same as our previous CI

Development experience: ❌ Not good. Jenkins documentation and cross-team standard library turned CI into a chore.

Scalability: ❌ This is where the matrix strategy shines. Regardless how much the number of tests increases, we will always be able to break the suite into more groups, keeping the total time stable. The problem with Jenkins is that, regardless how many agents we use, a lot of time would be lost to the slow operations (checkout + dependency management).

Conclusion

Taking all the above into account, we decided to migrate our CI pipeline to Github Actions on github-hosted runners.

Phase 5: Implementation

The purpose of the proof of concepts was to run the slow part, the tests. But to evolve to become our new CI pipeline, we needed at least our previous pipeline’s functionality: a test report, code linting on PRs and posting to Slack. In this section we’ll take a look at our new pipeline.

Below you can see the top-level file for our workflow. Like in our experimentation phase, we found that the ideal setup for us were 4 runners for unit tests (3 groups each), 1 runner for integration tests and a runner to combine the results and report.

# .github/workflows/ci.yml
name: Continuous Integration (Github-hosted)
on:
 pull_request:
   branches:
     - master
 push:
   branches:
     - master
concurrency:
 # head_ref for PRs is the branch name, for master it is an empty string. run_id is unique for each run.
 # This way we will cancel ongoing runs for PRs (to save $), but we will perform all runs for master.
 group:  ${{ github.head_ref || github.run_id }}
 cancel-in-progress: true
jobs:
 run_unit_tests_chunk:
   # this is the job for the unit tests (non-integration tests)
   strategy:
     matrix:
       groups: [ "[1,2,3]", "[4,5,6]", "[7,8,9]", "[10,11,12]"]
   uses: ./.github/workflows/ci_unit_tests_chunk.yml
   secrets: inherit
   with:
     groups: ${{ matrix.groups }}
     group_count: 12 # the total number of test groups, must match the groups listed in the matrix.groups
     parallel_processes_count: 3 # the number of parallel processes to run tests in worker, must match the size of the
                                 # inner arrays in the matrix.groups
 run_integration_tests:
   # this is the job for the integration tests (ie rspec, elasticsearch tests etc)
   uses: ./.github/workflows/ci_integration_tests.yml
   secrets: inherit
 combine_and_report:
   # this is the job that combines the results of the previous two jobs and reports them
   uses: ./.github/workflows/ci_combine_and_report.yml
   needs: [run_unit_tests_chunk, run_integration_tests]
   secrets: inherit

If in the future we decide to use a different number of runners and groups, we would change the line with the `group` and the line with the `parallel_processes_count`.

Another nice highlight is the elegant, declarative way to cancel any ongoing runs of the CI pipeline on subsequent commits in the PT. This is achieved with the concurrency setting.

Below is the job that handles the unit tests (parts were left out for brevity). You can find the logic of downloading from S3 the previous tests log that parallel_tests uses to break up the tests into even groups.

# .github/workflows/ci_unit_tests_chunk.yml
on:
 workflow_call:
   inputs:
     groups:
       required: true
       type: string
     group_count:
       required: true
       type: number
     parallel_processes_count:
       required: true
       type: number
env:
 GROUPS_COMMA: ${{ join(fromJSON(inputs.groups), ',') }}
 GROUPS_UNDERSCORE: ${{ join(fromJSON(inputs.groups), '_') }}
 DISABLE_SPRING: 1
jobs:
 checkout_setup_prepare_and_run:
   runs-on: ubuntu-latest
   services:
     db:
       image: postgres
       ...
     redis:
       image: redis
       ...
   steps:
     - name: Checkout Project
       uses: actions/checkout@v4
       with:
         ref: ${{ github.event.pull_request.head.sha || github.event.after}}
     - name: Install native dependencies for gems
       run: |
         sudo apt-get update
         sudo apt-get install -y \
           build-essential \
           ...
     - name: Setup Ruby
       uses: ruby/setup-ruby@v1
       with:
         bundler-cache: true # runs 'bundle install' and caches installed gems automatically
     - name: Configure AWS credentials
       uses: aws-actions/configure-aws-credentials@v2
       with:
         aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
         aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
         aws-region: us-east-1
     - name: Download parallel runtime log from S3
       run: |
         aws s3 cp s3://our-dev-bucket/uploads/ats-ci-log/parallel_runtime_test_github.log tmp/old_parallel_runtime.log
     - name: Create DBs
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
         PARALLEL_TEST_PROCESSORS: ${{ inputs.parallel_processes_count }}
       run: |
         bundle exec rails parallel:create parallel:setup
     - name: Run tests
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
         MINITEST_JUNIT_REPORTER: yes
       run: |
         bundle exec parallel_test test \
         -n ${{ inputs.group_count }} \
         --only-group ${{ env.GROUPS_COMMA }} \
         --serialize-stdout \
         --group-by runtime \
         --runtime-log tmp/old_parallel_runtime.log \
         --combine-stderr \
         --verbose \
         --exclude-pattern '.*elasticsearch_integration_test.rb' || true
     - name: Compress artifacts
       # the `upload-artifact` action is slow for many, small files. So we zip the junit files and upload them as a single file
       run: |
         zip -r test_reports_${{ env.GROUPS_UNDERSCORE }}.zip test/reports
         mv tmp/parallel_runtime_test.log parallel_runtime_test_${{ env.GROUPS_UNDERSCORE }}.log
     - name: Upload chunk results
       uses: actions/upload-artifact@v3
       with:
         name: test_reports_${{ env.GROUPS_UNDERSCORE }}.zip
         path: test_reports_${{ env.GROUPS_UNDERSCORE }}.zip
     - name: Upload file parallel tests runtime log
       # We upload to s3 the parallel tests runtime log, so that future runs can find it and use it
       uses: actions/upload-artifact@v3
       with:
         name: parallel_runtime_test_${{ env.GROUPS_UNDERSCORE }}.log
         path: parallel_runtime_test_${{ env.GROUPS_UNDERSCORE }}.log

Below is the job running the Elasticsearch and Rspec tests.

# .github/workflows/ci_integration_tests.yml
on:
 workflow_call:
env:
 DISABLE_SPRING: 1
jobs:
 run_integration_tests:
   runs-on: ubuntu-latest
   services:
     db:
       image: postgres
       ...
     redis:
       image: redis
       ...
     elasticsearch:
       image: elasticsearch
       ...
   steps:
     - name: Checkout Project
       ...
     - name: Install native dependencies for gems
       ...
     - name: Setup Ruby
       ...
     - name: Create DB
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
       run: |
         bundle exec rails db:create db:setup
     - name: Populate DB with fixtures and load into elasticsearch
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
       run: |
         bundle exec rails db:fixtures:load test:recreate_es_index
     - name: Run Unit tests
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
         MINITEST_JUNIT_REPORTER: yes
       run: |
         bundle exec parallel_test test \
           -n 1 \
           --serialize-stdout \
           --prefix-output-with-test-env-number \
           --combine-stderr \
           --verbose \
           --pattern '.*elasticsearch_integration_test.rb' || true
     - name: Run Rspec tests
       env:
         RAILS_ENV: test
         RAILS_MASTER_KEY: ${{ secrets.RAILS_MASTER_KEY }}
         MINITEST_JUNIT_REPORTER: yes
       run: |
         RAILS_ENV=test bundle exec rspec spec \
           --format RspecJunitFormatter \
           --out rspec/reports/TEST-rspec.xml || true
     - name: Upload Elasticsearch results
       uses: actions/upload-artifact@v3
       with:
         name: test_reports_elasticsearch
         path: test/reports
     - name: Upload Rspec results
       uses: actions/upload-artifact@v3
       with:
         name: test_reports_rspec
         path: rspec/reports

Below is the last job of the workflow, the one that combines the results from the previous jobs, generates the test report, runs the Danger gem, and posts a Slack message with the results. This is the most complex part of the workflow, but most of the complexity has to do with the construction of the Slack message, so if you are following our approach, yours could be simpler.

There are 3 points we spent a lot of time here:

The first is the construction of the message to send to the Slack API. The message is a multiline string, and Github actions have support for them (docs), but the syntax is a bit tricky (indentation too), especially if you have conditional logic, like us. So if you are building a Slack message similar to the one we are building (see below its final form), feel free to grab our code and modify it.

Press enter or click to view image in full size

One of the pieces of information in the above Slack message is the elapsed Time. At the time of our experiment, there was no available action to get it, so we implemented it ourselves by calling the Github Actions API from within the workflow, for itself. Check the `Compute elapsed time` step for that.

Finally, the `Failed tests` section lists the first 10 failed tests. Again, we implemented ourselves, calling the following script, from the step `Getting failed tests`:

require "hashie"
require "active_support/all"
require "rexml/document"# Defining class so that zeitwerk doesn’t complain
module Ci
 class WriteFailedTestsToFile
 end
end
# To be able to parse larger XML files
REXML::Document.entity_expansion_text_limit = 1_024_000
failures = []
Dir.glob("test_reports/test/reports/*.xml").each do |f|
 h = Hash.from_xml(File.read(f))
 test_case = h.dig("testsuites", "testsuite", "testcase")
 next if test_case.blank?
 test_cases = test_case.is_a?(Array) ? test_case : [test_case]
 failures += test_cases.select { |tc| tc["failure"].present? || tc["error"].present? }
               &.select { |tc| tc["classname"].present? && tc["name"].present? }
               &.map { |tc| "#{tc["classname"]}##{tc["name"]}" }
end
File.open("failed_tests.txt", "w") do |f|
 failures.each_with_index do |failure, i|
   break unless i < 10
   f << "#{failure}\\n"
 end
end

This is the final piece of the workflow:

# .github/workflows/ci_combine_and_report.yml
on:
 workflow_call:
jobs:
 combine_results:
   runs-on: ubuntu-latest
   steps:
     - name: Checkout Project
       uses: actions/checkout@v4
       with:
         # 100 for Danger to work
         fetch-depth: 100
         ref: ${{ github.event.pull_request.head.sha || github.event.after}}
     - name: Setup Ruby
       ...
     - name: Download artifacts
       uses: actions/download-artifact@v3
       with:
         path: artifacts
     - name: Decompress chunk test reports
       run: |
         find artifacts -name "test_reports*.zip" -exec unzip -d test_reports {} \;
         find test_reports -name "**/test_reports*.zip" -exec unzip -d test_reports {} \;
     - name: Configure AWS credentials
       ...
     - name: Merge parallel runtime log parts
       run: |
         cat artifacts/**/parallel_runtime_test*.log > parallel_runtime.log
     - name: Upload merged parallel tests runtime log to S3
       run: |
         aws s3 cp parallel_runtime.log s3://our-dev-bucket/uploads/ats-ci-log/parallel_runtime_test_github.log
     - name: Test Summary
       id: test_summary
       uses: test-summary/action@v2
       with:
         paths: |
           test_reports/**/TEST*.xml
           artifacts/test_reports_elasticsearch/*.xml
           artifacts/test_reports_rspec/*.xml
       if: always()
     - name: Get Commit Info
       id: commit_info_step
       run: |
         COMMIT_MESSAGE=$(git log --format=%B -n 1 ${{ github.event.after }} | head -1 )
         echo "commit_message=$COMMIT_MESSAGE" >> "$GITHUB_OUTPUT"
         SHORT_COMMIT_SHA=$(echo ${{ github.event.after }} | cut -c -8)
         echo "short_commit_sha=$SHORT_COMMIT_SHA" >> "$GITHUB_OUTPUT"
     - name: Getting failed tests
       # custom script to traverse the Junit XML files and find the failing tests,
       # constructing a file with their names to use them for our Slack message
       id: failed_tests_step
       run: |
         gem install hashie activesupport
         ruby lib/ci/write_failed_tests_to_file.rb
         cat failed_tests.txt         FAILED_TESTS=$(cat failed_tests.txt)
         echo "failed_tests=$FAILED_TESTS" >> "$GITHUB_OUTPUT"
     - name: Set job status
#         In this step we set the status of the job. Normally in case of failures, the next steps fail, so we have to
#         use `if: always()` to make sure the next steps run.
       if: ${{ steps.test_summary.outputs.failed > 0 }}
       uses: actions/github-script@v6
       with:
         script: |
           core.setFailed('There are test failures')
     - name: Run Danger
       if: always() # run this step even if previous step failed
       run: bundle exec danger --dangerfile=danger/Dangerfile
       env:
         DANGER_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}
     - name: Compute elapsed time
       # Unfortunately, GitHub Actions does not provide a way to get the start time of a workflow run. So we have to get
       # the start time of the current job and calculate the difference. This is not 100% accurate, but it's close enough.
       id: elapsed_time_step
       if: always() # run this step even if previous step failed
       run: |
         start_time="$(curl -L -H "Accept: application/vnd.github+json" -H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/workable/workable/actions/runs/${{ github.run_id }} | jq -r '.["run_started_at"]')"
         current_time="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
         start_time_epoch=$(date -d "${start_time}" +%s)
         end_time_epoch=$(date -d "${current_time}" +%s)
         elapsed_time_seconds="$((end_time_epoch-start_time_epoch))"
         elapsed_time_human=$(date -d @${elapsed_time_seconds} +"%M:%S" -u)
         echo "Diff in seconds: $elapsed_time_seconds"
         echo "Diff in human-readable: $elapsed_time_human"
         echo "elapsed_time_human=$elapsed_time_human" >> "$GITHUB_OUTPUT"
     - name: Prepare slack message
       if: always() # run this step even if previous step failed
       id: slack_message_step
       # Preparing our Slack message as multiline string
       # https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#multiline-strings
       run: |
         SLACK_MESSAGE="{"
         if ${{ github.event_name == 'pull_request' }}
         then
           SLACK_MESSAGE+="\"text\": \"ATS CI for PR <${{ github.event.pull_request.html_url }}|#${{ github.event.pull_request.number }}>: *${{ steps.test_summary.outputs.failed > 0 && 'Failure' || 'Success' }}*\","
         else
           SLACK_MESSAGE+="\"text\": \"ATS CI for \`master\`: *${{ steps.test_summary.outputs.failed > 0 && 'Failure' || 'Success' }}*\","
         fi
         SLACK_MESSAGE+=$(
         cat <<EOF
           "attachments": [
             {
               "mrkdwn_in": ["text"],
               "color": "${{steps.test_summary.outputs.failed == 0 && '#36a64f' || '#ecb22e'}}",
               "fields": [
                 {
                   "title": "Time",
                   "value": "${{ steps.elapsed_time_step.outputs.elapsed_time_human }}",
                   "short": true
                 },
                 {
                   "title": "Branch",
                   "value": "${{ github.event.pull_request.head.ref || 'master' }}",
                   "short": true
                 },
                 {
                   "title": "Commit",
                   "value": "<https://github.com/Workable/workable/commit/${{ github.event.after }}|${{ steps.commit_info_step.outputs.short_commit_sha }}>",
                   "short": true
                 },
                 {
                   "title": "Author",
                   "value": "${{ github.actor }}",
                   "short": true
                 },
                 {
                   "title": "Commit message",
                   "value": "${{ steps.commit_info_step.outputs.commit_message }}",
                   "short": false
                 },
                 {
                   "title": "Summary",
                   "value": "${{ steps.test_summary.outputs.failed }} of ${{ steps.test_summary.outputs.total }} tests failed (${{ steps.test_summary.outputs.passed }} passed, ${{ steps.test_summary.outputs.skipped }} skipped) - <https://github.com/Workable/workable/actions/runs/${{ github.run_id }}|Run>",
                   "short": false
                 }
         EOF
         )
         if ${{ steps.test_summary.outputs.failed > 0 }}
         then
         SLACK_MESSAGE+=$(
         cat <<EOF
                 ,
                 {
                     "title": "Failed tests",
                     "value": "\`\`\`${{ steps.failed_tests_step.outputs.failed_tests }}\`\`\`",
                     "short": false
                 }
         EOF
         )
         fi
         SLACK_MESSAGE+=$(
         cat <<EOF
                 ]
               }
             ]
           }
         EOF
         )
         EOF=$(dd if=/dev/urandom bs=15 count=1 status=none | base64)
         echo "SLACK_MESSAGE<<$EOF" >> "$GITHUB_OUTPUT"
         echo "$SLACK_MESSAGE" >> "$GITHUB_OUTPUT"
         echo "$EOF" >> "$GITHUB_OUTPUT"
     - name: Notify Slack
       uses: slackapi/slack-github-action@v1.24.0
       if: ${{ always() }} # run this step even if previous step failed
       with:
         # Slack channel id, channel name, or user id to post message.
         # See also: https://api.slack.com/methods/chat.postMessage#channels
         # You can pass in multiple channels to post to by providing a comma-delimited list of channel IDs.
         channel-id: "${{ github.event_name == 'pull_request' && '#ats-bots-prs' || '#ats-bots-ci' }}"
         # For posting a simple plain text message
         payload: ${{ steps.slack_message_step.outputs.slack_message }}
       env:
         SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

This is how it looks on Github:

Press enter or click to view image in full size

Things to point out / discuss

Some interesting things we would like to point out:

The action actions/upload-artifact@v3 and actions/download-artifact@v3 are known to be slow for multiple files. We ended up zipping up the partial test results and using the archive as artifact to speed things up.
We didn’t find a test report that was as good as the one in Jenkins, which sorts your tests by duration and other criteria. We ended up using test-summary/action@v2.
For Danger to work correctly, the shallow clone of depth 1 won’t do, the open issues mostly suggest a depth of 100. This isn’t Github Actions specific, it is a problem we had in Jenkins too.
Rails now offers support for parallel testing, which would allow us to get rid of a dependency. However, we opted to continue using parallel_tests, because it has clear semantics for grouping by runtime and for processing specific groups. More importantly, you may have noticed that the calculation of the groups does not take place once, in the beginning. It takes place in each of the unit test runners. This means that the calculation of the groups must be reproducible, so that all the tests appear in exactly one group. The algorithm parallel_tests implements meets this criterion.
Parallel_tests doesn’t always form completely equal groups. When this happens, the more long-running groups end up delaying the whole pipeline.

Future

We are planning the following improvements for the future:

Some parts of our workflow are reusable, and we are considering releasing them as (open source) actions on the Actions Marketplace. In particular, the calculation of elapsed time and extraction of commit metadata.
The jobs in our workflow have several common and repeated parts. We will work on making them more DRY.

Epilogue

That’s it.

Despite the differences our CI may have compared to yours, we hope that this article has served as food for thought. Feel free to share and drop us a comment if you feel like it.