Press enter or click to view image in full size
At any given time we have tens of experiments running on our site — constantly trying to build a better customer experience and focusing on differentiation. Running AB tests at scale and then being performant is key — nobody wants the flash of white that JS experiment platforms do before they evaluate a test.
Changes and new features to our site are usually deployed behind an AB test experiment. These produce a variant version of the page, allowing us to verify that real users appreciate the change and to quickly roll-back if they don’t.
Our experimentation platform has evolved from JavaScript executed in the browser to additionally using Rust on Fastly’s Compute@Edge. We want to share with you our journey on how we got to building a service that assigns experiments in P95 sub 1ms response times and lets us test at scale.
We built a Rust service called Kraken running in Fastly’s compute@edge distributed platform to compute the user’s experiment assignments. It is important to us that it adds no more than a negligible delay to page load time. In this blog post we explain how we reduced Kraken’s request P95 to under 1ms.
We treat experiments as a type of feature flag. Feature flags can also be set based on the domain, and overridden by a URL parameter.
The problem with landing pages
loveholidays’ website is routed through Fastly’s edge distributed network. Landing pages are cached by Fastly for a period of time. Fastly’s quick response time improves the customer’s experience and improves our search engine ranking. The caching also reduces load on our website and saves us money on infrastructure costs.
Landing pages were previously cached using the URL as the key. Since everyone visiting the page received the same copy of the page, we were unable to run experiments on the landing pages.
We could have built an experiment aware caching service to send the correct cached versions of the landing page to users, but we’d be losing performance if this wasn’t running at the edge.
Enter Kraken and compute@edge
Today, we use the URL and the feature flag assignments of the user as the cache key of landing pages. Kraken is our compute@edge microservice responsible for computing the assignments. The assignment key is a list of feature flags and the value for that user. Multiple users will be bucketed with the same assignments key, so new visitors still get the benefit of caching.
We went with compute@edge because our Varnish service runs in Fastly’s network and it needs to be able to call Kraken without incurring large network overheads. We wrote Kraken in Rust as it is performant and was the most practical language available on compute@edge. We are a JavaScript (TypeScript) team but we didn’t consider JavaScript or evaluate its performance, as it wasn’t available at the time Kraken was written.
Press enter or click to view image in full size
We use the assignments key as part of the cache key, and we send it to the web server to guarantee the page is generated with the expected feature flags.
Measurements
In order to discuss speed we need to define what we’re measuring.
We start a timer at the beginning of Kraken’s main function, and we log it at the end.
Our times do not include the network overheads incurred when making an HTTPS request to Kraken. Nor do they include any overhead incurred before the main function itself is reached.
We capture the logs in Grafana Loki.
Timings
Press enter or click to view image in full size
The above graphs show a 3 hour period in the last few days. The values are averaged over a 10 minute time period. The top graph is of the response times of the successful requests - we discard requests with missing data as they would skew the response times. The p95 is under 1ms. The p50 is about 0.5ms (~500μs). We also record the worst case outliers, the slowest of which was over 8ms. The 2nd graph shows us receiving about 2000 requests to the landing pages every 10 minutes.
We cannot explain our worst case response times. We do not believe there is anything about Kraken that would explain some requests being an order of magnitude slower than the p95. But whilst curious, the worst case is still fast and only a tiny fraction of our requests.
The initial version of Kraken, the proof of concept, had a p50 of about 20ms. It wasn’t running production workloads, so it is harder to get a stable graph or to be confident of a number. Slowing requests by 20ms was not acceptable. The p95 was even slower and we were seeing timeouts from our self imposed rules.
This is how we made it fast
Embed config into the application
In the beginning, Kraken would request 2 JSON files about our feature flags from Google cloud storage on every request. GCS is fast, and these were cached by Fastly. Even so these network requests were responsible for the majority of the run time.
Rust has a handy macro include_str! (and the related macro include_bytes!). These include files within your compiled application. It comes with a lot of advantages:
- Requesting a file at runtime can potentially error. By including it at compile time you remove one of the ways your application can fail.
- The file is instantly available. The file reading code doesn’t block or need to be
async(not applicable to compute@edge). - The Rust memory lifetimes become
static(the easiest lifetime to pass around). - We can optimise the config that is built into the app (discussed more later).
The downside is that whenever the config changes we need to rebuild and redeploy Kraken. Fortunately, our CI pipeline automates this as we have a fetch-features GitHub Action that runs every half hour and commits (if necessary) the latest versions of the JSON to the main branch. Pushing to main then triggers a 2nd GitHub Action which runs sanity checks on the config, before building and finally deploying the new version.
In the future, we may improve this further by using a webhook to trigger the workflow immediately after the config is updated.
Shortening our config
Profiling was telling us that parsing JSON was one of the slowest parts of Kraken. We use Serde to parse our config into Rust structs and enums. Serde isn’t the fastest JSON parser, but is by no means slow.
Much of the config in the JSON files is historic and not applicable to computing landing page assignments. We changed the fetch-features GitHub Action to remove the irrelevant portions. We initially used the jq command to prune out the irrelevant config - but this quickly became too complicated. We switched instead to TypeScript and ts-node to do the pruning. We were tempted to use Deno because it’s great for scripts (native access to the fetch API!) - but decided against it based on team familiarity.
The speed up in the JSON parsing sections was approximately proportional to the reduction in the amount of data. Unfortunately, it goes both ways - when additional experiments are put live we see the runtime go up.
Today, 192 of the 223 experiments in the source config are irrelevant to Kraken, so we only include the 31 useful experiments.
Tightening our config
We removed a lot of experiments and feature flags that were instantly skipped over from our config. But we also had fields that were skipped over - for example the created-at date of an experiment is skipped because we don’t need it to compute experiment assignments.
The TypeScript scripts which add the config to the repo were modified to remove these fields.
In our case, we didn’t measure a significant improvement from this change. The improvement was small enough that it could have been noise. It’s possible that with a greater proportion removed or different JSON parsers there may be a more noticeable improvement.
There are 5,363 lines of JSON in the experiment’s source config, and only 802 in the version inside Kraken (but most of this will be from dropping experiments rather than fields).
Literalizing our config
Since parsing of JSON stood out in the run time profile, we moved the parsing of the site feature flags to build time. We switched the site feature flags from an embedded JSON string to native Rust literals.
Get Chris Couzens’s stories in your inbox
Join Medium for free to get updates from this writer.
We can write Rust source code using a script in another language (TypeScript in Kraken). Kraken’s generated Rust code is more efficient than the code it replaced as it does less.
This technique may not work so well in interpreted languages, as interpreting the source code may be about as slow as parsing JSON.
Rust has build.rs scripts and macros for generating code. We chose not to use either because we wanted to have the generated code committed to the repo to make it easier to read, easier to debug and so that it can be tracked over time.
Our workflow followed this pattern:
- Write enough of the Rust code manually so we know what we’re aiming for. Typically this would be the full functionality, but with only one or 2 cases.
- Write a script that
- Reads the input data
- Manipulates it as required
- Has functions to safely write dynamic data into Rust strings. The Rust reference has a useful page to help identify what characters need to be escaped in which situations.
- Create a giant string of the Rust code and write it out to a file or stdout. - Immediately format the new file.
cargo fmtwill format it (along with all the other files), or you can pipe it throughrustfmt.
For Kraken’s site feature flags, we did more than convert from JSON syntax to Rust syntax. We identified that there are only 6 different possible inputs (including the default catch-all case) to the site-features function, and we could pre-compute the outputs at build time. This allowed us to replace all the runtime logic with a simple match (Rust switch expression) based lookup function.
Rather than partially sharing Kraken’s source code, we’ll illustrate with a similar change Chris made to a project he co-maintains. Within this PR, the structured data variables_list.txt is read and transformed by build_variables.py to produce the Rust code src/variable.rs.
Removing Regular Expressions
Regular expressions are a 2nd programming language within your source code. In every language/library we’re aware of, they’re not compiled until runtime. In a typical web service this would be done once on first usage or when initialising the app. Subsequent uses would then be fast. Serverless functions do not have the option of amortising a one off cost.
We removed 2 regular expressions and saw a performance improvement in both cases.
In the first case, we had a regular expression to work out the page type from the URL path. This was moved to Fastly’s Varnish service and passed in as a request header (Varnish needed the value and to do the work anyway).
In the second case we were using a regular expression to normalise host names (remove www and staging prefixes). This was trivial to rewrite using trim_start_matches.
In both instances, we saw a measurable decrease in runtime.
Simplicity
Investing in simplicity is one of our core values. All the changes we’ve made to Kraken to improve performance have also simplified it. Complexity has been moved from the core app into GitHub Actions, which is simpler overall because the GitHub Actions commit their output to the codebase.
Additional Tips
Have performance metrics. They are useful for confirming changes are neutral or an improvement.
Have alerts for performance regressions. They are useful for spotting problems once you’re no longer regularly looking.
Have a technique for profiling locally. My preference is to locally log out times at various points and thus identify the slowest sections.
Have a goal in mind. Know when you’re fast enough. Kraken’s 95th percentile runtime is now under 1ms. Improving it can shave at most 1ms off a typical page load which is below human perception levels. However our slowest requests still have room for improvement. Optimising runtime can also reduce the amount of hardware required or the infrastructure you’re charged for.
Failed Improvements
Experiments JSON
We wanted to pre-parse the experiments into Rust literal objects, to avoid the runtime cost of parsing JSON. We’ve seen benefit from this technique with the site feature flags.
Our experiments will likely remain parsed from JSON. The experiments are processed using a library we share with a NodeJS app (Rust crates can be compiled to NPM modules using WASM), which passes the data in as JSON. This prevents changes that wouldn’t work in the NodeJS app.
We know that parsing the JSON is disproportionately slow, but we cannot easily profile further to see what aspects within the JSON are slow. We tried a couple of ideas (listed below), but they didn’t provide any speed improvement.
Using different data structures
Each experiment within our JSON file has a list of page types it is allowed to run on (so that our landing pages do not have too many running experiments making them uncacheable). Typically this list contains a couple entries. It could at most contain a dozen entries. We check it once per request to see if the current page type matches.
We currently use a BTreeSet but we thought a Vec (array) may be more appropriate. A BTreeSet has O(log(n)) lookup time, which is better than a Vec’s O(n) lookup time. But this doesn’t help because it has worse insertion time, we only do the lookup once and big O notation isn’t helpful at describing runtime involving small amounts of data.
We experimented with Vecs and HashSets to see if they improve the performance. If there was an improvement, it was so slight that we couldn’t detect it in my benchmarks.
Experiment removing some Heap allocations
We tried parsing the experiments into a slightly different data structure that would eliminate the need to allocate heap memory for strings in many cases. We weren’t able to measure any benefit from this change.
The following Rust program illustrates the concept being discussed. Beware that this section includes the concept of borrowing memory which may be unfamiliar to developers who haven’t used systems languages.
We represent our JSON using UTF-8, which conveniently is the same text encoding used by Rust strings. Serde is able to point a &str at a sub-slice within the JSON string. This prevents the contents of the string from being repeated twice in memory. If we switch our Experiment structs to represent strings as &str instead of String then we will prevent lots of memory allocations.
JSON strings use backslash as an escape character. Rust strings do not use escape sequences in memory (but they do in source code). This prevents Serde pointing a &str at a JSON string that uses escape sequences. When trying to parse such a JSON string into a &str we’d receive an error.
Cow (Copy On Write) can be our escape hatch, giving us good performance in most cases, but allocating memory when required.
Changing our strings in our experiment data structures to Cow<'a, str> didn’t give a measurable improvement in the benchmarks so we didn’t push ahead with the change.
What’s still to be done
Measure in CI
We should record and test speed as part of the CI runs. The fastly compute serve command was added in July which could allow us to measure and record performance improvements using GitHub Actions before going live.
Conclusion
In a serverless function where instances are not reused between requests, everything that can be done at build time should be done at build time to avoid slowing requests. This requires taking approaches that would be unusual in a typical server application.
Using these approaches allowed us to achieve our goal of computing experiment allocations in under 1 millisecond.
See also
David Annez presenting how loveholidays uses Fastly and from 23:18 specifically compute@edge and Kraken