I built a small project — cloudprice.net more than 5 years ago as a tool to help myself when I worked at Microsoft. That was a simple cheat sheet page with a list of all virtual machines available on Azure with the ability to quickly find virtual machines via search or filters. 5 years ago the Azure calculator and Azure website weren’t great at doing their task and to be honest they are still not good.
Below I will describe how I optimized my main database a few weeks ago by migrating from CosmosDB to ClickHouse and increased speed by around 7x times, decreased cost 6x times, and optimized storage 44x times without any schema changes (crazy, right) in that project.
With that article, I want to highlight the importance of using the right technology for the right workloads and evaluate technologies not based on the fact that there is another fancy or hyped tech or use what you are used to for every task. Try to analyze how your project will evolve in 1–2 years and what you will need from that tech in the future.
Press enter or click to view image in full size
Idea behind cloudprice.net
The initial idea was quite straightforward. The project was super simple: show specs for each virtual machine. At that time, there were only 50 different VM types on Azure.
The project was written on C#.net, hosted on Azure and stores all data in Azure CosmosDB — which was newly presented somewhere 5 years ago as a competitor to AWS DynamoDB. Yep, there were better choices for the database, especially for what the project involved over those 5 years.
Currently, the website is more of an analytical tool that helps users choose the right VM size, compare technical specifications, analyze pricing across regions, recommend using Machine Learning best options etc.
It’s a simple data pipeline based on Azure functions that triggers each day and re-uploads all the data, including pricing and specifications, from the official Microsoft Azure API.
I join a few thousand JSON objects with each other and save it as a JSON document in Azure CosmosDB in a (de)normalized way. Each document was a mix of pricing data and VM specifications that gave me relatively fast results because I did need to run any joins between pricing data and VM specifications.
Press enter or click to view image in full size
This is suboptimal from a storage perspective because I stored the same properties about VM specification for each pricing point but quite efficient from a speed perspective.
The main idea was to provide the fastest tool to see that information, and that goal is still valid currently.
Speed and Cost start hurting
But time flies and soon, from 50 VM types, it becomes 608. From 10 properties for each VM, it grows to 30.
Users started to ask for more functionality and aggregated analytics like the cheapest price across all regions for a particular VM or average price comparison between regions or price history.
All these things are pretty hard to implement on top of a document database; therefore, in most cases, I created additional Azure Functions that were pre-calculated daily.
A lot of code wastes time connecting to the database, fetching data, serializing/deserializing these documents, calculating metrics, and storing in S3. Oh, such a waste of energy and my time to code that and still off support down the line .
When you run a project and cover it out of your pocket, you notice un-optimized costs immediately in all places. Your internal voice starts shouting at you each time you see such things.
Azure CosmosDB is a serverless database based on how much you store the data and how much you insert and query it.
The last two aspects start to grow linearly to the growth of popularity of my website from 2k users to 30k per month. But the most cost came from inserts/upserts because I upload around 4.5M rows daily, costing me about $20–30 for each upload.
From $50, the cost rises to $300 for a month.
Press enter or click to view image in full size
Getting around $50–100 profit per month at that time, I started to worry more and more.
CosmosDB had a baseline option, but it limits the throughput in that case. I tried to scale it automatically before uploading and scale down after, but still, it was a pain, and I got a lot of errors regarding hitting the limit of throughput.
It became clear that I needed to change the database to something more appropriate to my workload.
A New Hope
Around a year ago, I saw and introduced myself to ClickHouse, a columnar-based database with the goal of making it fast like my project.
It was developed initially at Yandex and open-sourced a few years ago and has around +25k ⭐ on GitHub with wild support by the community with +100k commits.
A few months ago, I spent around half the day analyzing how my project could use ClickHouse as the main database. My main requirements were:
- Fast queries to get a list of virtual machines for each azure region (500–600 documents), calculate aggregates on all datasets without any correlation to the dataset size.
- Remove cost growth compared to CosmosDB or even optimize it.
- Easy deployment and ability to work with .net core.
I deployed the Clickhouse dev and test environment at https://double.cloud/ for our small team to see how it will handle our workload and queries to avoid time spent configuring SSL, backups, security aspects, and get out of box monitoring tools.
I took the smallest instance of 2 vCPUs and 4GB of RAM with a 32GB gp2 disk just to check how it would work.
Upload the data around 4.5M of rows using almost the same data schema paradigm with a (de) normalized approach.
Get Victor Kiselev’s stories in your inbox
Join Medium for free to get updates from this writer.
The only two things I needed to solve were adjusting data types and finding a good .net core library to work with Clickhouse.
The schema is super simple, and it was pretty straightforward to set column types. I just applied LowCardinality to some properties to make attributes similar to dynamic dictionaries to speed up queries.
CREATE TABLE default.price
(
`name` LowCardinality(String),
`CPUdesc` LowCardinality(String),
`CpuArchitecture` LowCardinality(String),
….
`bestPriceRegion` String,
`bestSecondPriceRegion` String,
`modifiedDate` DateTime,
`_upload` Int32
)
ENGINE = MergeTree
PARTITION BY _upload
ORDER BY (regionId, Currency, name, tier, paymentType)Regarding the .net core library, I found the most popular ones:
- ClickHouse.Ado ⭐210, Nuget downloads: 496.3K
- ClickHouse.Client ⭐175, Nuget downloads: 272.0K
- ClickHouse.Net ⭐54, Nuget downloads: 121.8K
Finally, I choose to use ClickHouse.Client because it has a more straightforward API for bulk upload.
Result of benchmarks
Storage
First of all, I noticed the compression. That’s a real hidden gem of ClickHouse my CosmosDB dataset with data of one day upload was 3.57 GB.
Press enter or click to view image in full size
In ClickHouse, the same amount of documents/rows became 80MB on disk; uncompressed, that’s around 640MB, which is an 8x compression rate.
It looks like CosmosDB stores data under the hood within a few copies, indexing all fields that lead to 3.5GB. Compared to ClickHouse, the difference is 44x times.
It’s certainly not honest to compare super high availability CosmosDB storage to one node EBS disk. However, if I had ClickHouse in a high-availability configuration (3 nodes), the difference still would be around 15 x times.
And I don’t mention the S3 hybrid mode of ClickHouse that could have zero-copy data storage.
Speed of Queries
I’m storing each upload of data in a dedicated partition, which gives me a few advantages in speed and de-duplication. I’m planning to cover these things in Part 2 of the article.
Most of my queries look like that:
SELECT
*
FROM
price c
WHERE
c.Currency = 'EUR'
AND c.regionId = 'westus2'
AND c.tier = '0'
AND c.paymentType = '0'I need just a filter using WHERE statements and sometimes having Group BY. That’s almost the same query I used in CosmosDB, and the speed difference was practically identical to other similar queries.
Note:
- CosmoDB has max items per page, and in my scenarios, the number of documents did not fit on one page, which led to double time on one query.
- I did not mention queries where I need to calculate averages or do some more computations on rows because CosmosDB is not fit well for such workloads.
Bonus ARM64 vs AMD benchmark
I also ran tests to compare Arm64 — Standard D2pls v5 (2 vCPUs, 4 GiB memory) vs AMD — c5a.large (2 vCPUs, 4 GiB memory) instances with the same vCPU and ram characteristics.
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Summary
The migration took me around 3 full days of work, including understanding Clickhouse specifics and column types and migrating to another database’s SDK.
What perks did I get?
- No more constant cost growth, my current instance with 2 CPUs handles workload and even during high load ingesting just uses around 5–10% of CPU, same for memory. I shrank from $300 to $50 per month, the cost of the database, 6x times!
- I unlocked the ability and already implemented price history and started to store it because I don’t have any more limitations from cost and speed perspectives.
- Increased speed of API backend from 200–400 ms to around 50 ms, 7x times.
I hope others will count my mistakes and start to use the right technologies for the right workloads.
P.S. Part 2 is published for that article to cover aspects of how I battled with duplicates in ClickHouse and faced some quirks and edges.