Moving from DynamoDB to tiered storage with MySQL+S3

8 min read Original article ↗

Shane Hender

Originally we implemented a feature to persist an event-stream into DynamoDB to allow customers to retrieve them. This proved effective, serving as a strong use case for a key/value storage, yet the drawback was its high cost. Moving to provisioned billing-mode reduced cost by ~50%, but that was not going to be sustainable as we scaled to more customers. We also kept multiplying the cost each time we wanted to allow querying on another field in the logs via Global Secondary Indices.

The architecture is in the diagram below… fairly straightforward :)

Press enter or click to view image in full size

Data flow with DynamoDB

Approach

Eventually we decided on a hybrid approach to get the best of both worlds. This was to use MySQL to write the logs from Kafka, and periodically upload the data to S3 to allow removal of that data from MySQL to prevent overwhelming the DB. The advantages of this solution were:

  • Cheaper than DynamoDB (S3 is dirt cheap and Aurora MySQL costs don’t scale as badly as DynamoDB with data growth)
  • Easy to query the data (really easy in MySQL, but S3-Select is good too)
  • Reasonably performant to serve up API requests for data

Press enter or click to view image in full size

For a bit more of an in-depth description, here’s the workflow:

  • We pickup the logs from Kafka in JSON format
  • These are written to the buffer table in MySQL.
  • Every hour a background job kicks off that reads from the metadata table to see what the last row we uploaded to S3 was.
  • This job reads batches of 10,000 logs from the buffer table and uploads them to a S3 file.
  • The job also writes an entry back to the metadata table for each S3 file uploaded to store the S3 path, the number of logs in the file, its last log ID and also the last timestamp stored in that file.
CREATE TABLE `metadata` (
`id` binary(16) NOT NULL,
`end_time_ms` bigint(20) DEFAULT NULL, # last timestamp this file contains
`file_path` varchar(255) DEFAULT NULL, # path of S3 file
`log_count` int(11) DEFAULT NULL, # count of the logs this S3 file contains
`last_log_id` binary(16) NOT NULL, # ULID of last log we processed
);
  • This continues until all rows in the buffer table have been uploaded (except the last 5mins of data to prevent Kafka late deliveries causing ordering issues)
  • Another background job also kicks off every hour to delete data in the DB that’s older than 4 hours to control the size of the buffer table.

Querying

So the data pipeline is all setup, but does it actually give us the data back in a timely fashion when the client makes an API request for it?

Press enter or click to view image in full size

The above graph shows the relative performance of Aurora (yellow) vs S3-Select (purple) when querying data. S3-Select can be surprisingly fast (keep in mind we batched by only 10,000 items per file), and extremely cheap!

Get Shane Hender’s stories in your inbox

Join Medium for free to get updates from this writer.

To give a clearer picture of the query path:

  • Based on the time-window specified, we find the relevant S3 files based on our metadata entries in the MySQL DB.
  • If no other filters are specified (e.g. user-id, url, …) then we’ll just select enough S3 files to satisfy the number of results we want using the log_count column in the metadata table row.
  • After determining which S3 files we need to query, we create an S3-Select query like the below, where timestamp is just a field in the stored JSON in the file:
SELECT * FROM S3Object s WHERE s."timestamp" >= '%s' AND s."timestamp" < '%s' LIMIT 100
  • Those S3-Select queries are done in parallel.
  • If we couldn’t get enough results from S3, then we will attempt to read more logs from our MySQL buffer table.

Optimizing

Once we had the above all working we encountered a performance problem when the client wanted to filter results by any field besides the timestamp. For example if the client wanted logs for a specific user-id, then we’d at worst have to scan ALL the S3 data within the time-range to find any relevant logs, which made it difficult to figure out what queries to do in parallel.

Press enter or click to view image in full size

Having to potentially search all S3 files is painful

So given that we can’t hold up an API response for hours, we attempted to reduce the number of S3 files that we need to scan.

Our first iteration was to just duplicate the data in S3 given that it was so cheap. So we constrained which fields the client could filter by and then duplicated data and metadata, e.g. for the user-id field:

  • Create normal S3 file that has all 10,000 logs from the buffer table batch
  • Then create an S3 file with the above logs divided up into their own files based on user-id, i.e. a file for user-id 1, another for user-id 2, …
  • Create a metadata record for each of these files with a file_type column indicating whether it was for a specific field or all logs.

This quickly became cumbersome as we would have to create an S3 file for not only each filter-able field, but also for any combinations of filtering, e.g.:

  • 1 file for user-id specific logs
  • 1 file for url specific logs
  • 1 file for combination of user-id and url logs

Enter Bloom Filters. Bloom filters have the nice property that we can ask the bloom-filter attached to a metadata record the question “is this <user-id> definitely not in this S3 file?”. And if we get “yes”, we know for sure that we can skip scanning this file. If we get a “no”, then due to the nature of bloom-filters it may or may not be in the file so we’ll have to scan it.

So this was great, we had a way to filter down the S3 files to scan saving money and time. But we still had the problem of how many files to scan. If the client doesn’t specify a filter, then we had the simple task of just scanning enough S3 files by totaling the log_count column. But if 100 S3 files had some logs relating to the specified user-id, then we had to scan them all just in case each file only had one relevant log each to fulfill a request for 100 results.

Enter Count-Min Sketch. This data structure behaved much like the above bloom filters, but instead of asking a binary question of whether an item existed in it or not, it instead gives an estimated count of how many times that item occurred.

So with these 2 data structures combined, we can now support queries for user-id based on the following workflow:

Press enter or click to view image in full size

  • Read the relevant metadata rows that match the time range specified.
  • For each metadata, read the corresponding bloom filters and count-min sketches that match the filters, e.g. user-id, based on the below schema.
CREATE TABLE `metadata_bloom_filters` (
`id` binary(16) NOT NULL,
`metadata_id` binary(16) NOT NULL,
`name` varchar(255) DEFAULT NULL, # name of field in log, e.g. "user_id"
`bloom_filter` text, # serialized bloom filter
`count_min_sketch` text, # serialized count-min sketch
);
  • If the bloom filter says that the user-id might be in the S3, we add it to the list of files to scan.
  • We also maintain an estimated count of matching logs from the count-min sketch to limit the files we have to scan to fill up the response page size.

We can also now have 1 bloom/count-min sketch structure for each field, and when we want to filter by multiple fields we just have to check existence in their corresponding bloom filters.

Now we have a best-effort to minimize the number of S3 files to scan. However in the worst case we may get unlucky and get all false positives on the bloom filters and actually return an empty result back to the client.

But since we use Cursor-Based-Pagination for APIs at Zendesk, we provide the client caller a has_more field in the response to indicate they should request the next page (based on the last ULID in the last S3 file scanned) in case there are potentially more hits in subsequent S3 files. We’ll eventually set has_more=false when there are no more S3 files to scan in the time-range.

And there you have it. I’m personally impressed at the response time of S3-Select which I originally expected to be in the seconds. But most of the time it’s around 200–500ms with some spikes up into the seconds which we have yet to optimize (possibly playing with the number of logs per S3 file).

When we first migrated from Dynamo, our costs were <20% of the provisioned Dynamo costs and less than 10% of that 20% being for the combined usage of S3 storage and S3-Select, with remaining 90+% for Aurora. Huge win against the initial per-customer costs with DynamoDB.

Thanks to

for working on this project with me! This was definitely a journey 😅.

Also thanks to

and for helping edit this post! 🙇