Realtime Fastly logs with Grafana Loki for under $1 a day

A real-time logging solution that allows you to capture all the Fastly CDN logs for less than a cup of coffee a day? Yes, including super hipster oat milk flat white (or what do hipsters drink these days?) in a super hipster coffee house. Taking into account the amount of Fastly logs that we process (it is more than 30 gigabytes per day), it is hard to believe, but let me introduce you to Grafana Loki. Though before we get to the implementation, I would like to tell the story which reflects our logging journey.

The story

What is the challenge?

Fastly is a content delivery network (CDN), and if it does not tell you anything, you should probably start with the official Fastly documentation. It serves millions of requests for loveholidays daily. Fastly is the first thing that our customers are talking to. Moreover, occasionally it is the only thing customers are talking to as entire pages are being served from CDN. Therefore, Fastly gets many more requests per user interaction than we get on our side of the infrastructure meaning that Fastly also gets a lot more logs. It could be quite challenging to find infrastructure that can deal with CDN level logging because CDNs are extremely busy.

Before Loki era

We have been capturing all the relevant Fastly logs and storing them in BigQuery for a couple of years now. We have tens of terabytes of logs in BigQuery. As we use GCP, BigQuery was super easy to integrate. It is fairly cost-efficient to store and analyse in BigQuery. The most beneficial thing about BigQuery is that it allows us to examine “how are we doing?” retrospectively. Questions like “how many users have got a bad response?”, “how many users landed at certain pages?”, “where are these users from?”, etc. etc. are helpful to us in driving different business KPIs and grasping some specifics such as the load times, the user device versions, to name a few — all of those questions that we might ask the next day, next week, next year and compare year on year, month on month. Something like Kibana or Loki is not particularly suitable for this purpose as you need to store a great number of logs and compare them. In reality, none of the logging solutions is truly made for this. BigQuery is ideal for this kind of task — it is literally the big data analytical database.

Why is BigQuery not enough?

However, if you are developing services, making changes to them, or you are having technical problems or outages, BigQuery is not the most useful and convenient tool to figure out what is actually happening with your systems in real time. In order to get to the bottom of the issue you have to write complicated SQL queries and re-run them continuously. Let us look at the most annoying disadvantages of BigQuery in our case: the filtering mechanism is rather slow if you take into account the trouble of writing SQL queries, the query output is not visual (you do not have fancy graphs), the result of your investigative work is hard to share amongst people, and it can be quite costly (especially if you don’t use BigQuery Slots as we do). We wish to continue using BigQuery because we want to compare long-standing improvements towards our infrastructure to find out what sort of experience our users get. However, we were in need of a true logging solution which is meant for quick and effortless data retrieval with an aim to understand what the potential problems we are currently facing are; the logging solution that is good for either real-time logging or recent events when you do not need to compare this month versus last year.

Why not ELK?

Seemingly the most common solution is to use ELK. Doubtless, ELK is well-established. But if you are on the infrastructure side of things and you do not want to maintain ELK, you might not be so keen on it. ELK needs a lot of CPU, RAM, it is JAVA based (which is not faulty in itself, it is rather a matter of opinion), not very flexible in terms of scaling with the demand (ironically, it is not very elastic) and pricey. We were worried about sending such a high volume of logs as our CDN’s, which sometimes withstands DDoS attacks and gets absolutely smashed with traffic. We knew that our ELK (24GB RAM cluster) would not handle it, meaning that some other applications that are typically logging into it at a much lower scale would also be impacted. There would always be a risk to simply melt our ELK so, in the end, we would not be able to see any logs. Also, we do not want to pay double the price of what we are currently paying for ELK.

Why not SaaS?

The market of SaaS logging solutions is plentiful. These solutions are solid, often expensive and yet another tool into our extensive stack of monitoring tools. Thus, we asked ourselves whether we could use Grafana Loki for ingesting external HTTP logs. We had been using Loki for ingesting Kubernetes logs already before we were faced with this challenge. We asked the Grafana community for advice and the answer was not definitive — there was no ready-made solution. Although there were some signs that it was possible, the path was fairly bleak. Being very eager on Loki and eager to find another appropriate use case for it, we still decided to go forward and give it a try.

So, why Loki?

Loki is written in Golang, it has low memory overhead, it does not maintain in-memory indexes, it is using GCS for storing data, and, on top of that, we wanted to reduce our monitoring tools. Loki complements our existing monitoring solutions such as Grafana and Prometheus. Loki seemed like an organic continuation of what we have already been using and aligned with our goal to simplify our monitoring stack.

What is the actual config?

Get Alina Frolova’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Our Loki set-up for Kubernetes logs includes Promtail. This configuration allows us to successfully capture all of our clusters’ logs. But surprisingly it did not work in our case with Fastly. We wasted a lot of time attempting to comprehend why it does not work. Eventually, we decided that we should step away from Grafana’s Promtail, and use something proven, something that has been there for a while and that is part of GKE. Fluent Bit enters the stage. Luckily for us, Fluent Bit has a plugin for HTTP endpoints. Also, Fluent Bit is well integrated with Loki. We tried it and it worked really well. Additionally, we were able to add extra labels to the data coming in through Fluent Bit which is great for the usability and observability of logs.