A real-time logging solution that allows you to capture all the Fastly CDN logs for less than a cup of coffee a day? Yes, including super hipster oat milk flat white (or what do hipsters drink these days?) in a super hipster coffee house. Taking into account the amount of Fastly logs that we process (it is more than 30 gigabytes per day), it is hard to believe, but let me introduce you to Grafana Loki. Though before we get to the implementation, I would like to tell the story which reflects our logging journey.
The story
What is the challenge?
Fastly is a content delivery network (CDN), and if it does not tell you anything, you should probably start with the official Fastly documentation. It serves millions of requests for loveholidays daily. Fastly is the first thing that our customers are talking to. Moreover, occasionally it is the only thing customers are talking to as entire pages are being served from CDN. Therefore, Fastly gets many more requests per user interaction than we get on our side of the infrastructure meaning that Fastly also gets a lot more logs. It could be quite challenging to find infrastructure that can deal with CDN level logging because CDNs are extremely busy.
Before Loki era
We have been capturing all the relevant Fastly logs and storing them in BigQuery for a couple of years now. We have tens of terabytes of logs in BigQuery. As we use GCP, BigQuery was super easy to integrate. It is fairly cost-efficient to store and analyse in BigQuery. The most beneficial thing about BigQuery is that it allows us to examine “how are we doing?” retrospectively. Questions like “how many users have got a bad response?”, “how many users landed at certain pages?”, “where are these users from?”, etc. etc. are helpful to us in driving different business KPIs and grasping some specifics such as the load times, the user device versions, to name a few — all of those questions that we might ask the next day, next week, next year and compare year on year, month on month. Something like Kibana or Loki is not particularly suitable for this purpose as you need to store a great number of logs and compare them. In reality, none of the logging solutions is truly made for this. BigQuery is ideal for this kind of task — it is literally the big data analytical database.
Why is BigQuery not enough?
However, if you are developing services, making changes to them, or you are having technical problems or outages, BigQuery is not the most useful and convenient tool to figure out what is actually happening with your systems in real time. In order to get to the bottom of the issue you have to write complicated SQL queries and re-run them continuously. Let us look at the most annoying disadvantages of BigQuery in our case: the filtering mechanism is rather slow if you take into account the trouble of writing SQL queries, the query output is not visual (you do not have fancy graphs), the result of your investigative work is hard to share amongst people, and it can be quite costly (especially if you don’t use BigQuery Slots as we do). We wish to continue using BigQuery because we want to compare long-standing improvements towards our infrastructure to find out what sort of experience our users get. However, we were in need of a true logging solution which is meant for quick and effortless data retrieval with an aim to understand what the potential problems we are currently facing are; the logging solution that is good for either real-time logging or recent events when you do not need to compare this month versus last year.
Why not ELK?
Seemingly the most common solution is to use ELK. Doubtless, ELK is well-established. But if you are on the infrastructure side of things and you do not want to maintain ELK, you might not be so keen on it. ELK needs a lot of CPU, RAM, it is JAVA based (which is not faulty in itself, it is rather a matter of opinion), not very flexible in terms of scaling with the demand (ironically, it is not very elastic) and pricey. We were worried about sending such a high volume of logs as our CDN’s, which sometimes withstands DDoS attacks and gets absolutely smashed with traffic. We knew that our ELK (24GB RAM cluster) would not handle it, meaning that some other applications that are typically logging into it at a much lower scale would also be impacted. There would always be a risk to simply melt our ELK so, in the end, we would not be able to see any logs. Also, we do not want to pay double the price of what we are currently paying for ELK.
Why not SaaS?
The market of SaaS logging solutions is plentiful. These solutions are solid, often expensive and yet another tool into our extensive stack of monitoring tools. Thus, we asked ourselves whether we could use Grafana Loki for ingesting external HTTP logs. We had been using Loki for ingesting Kubernetes logs already before we were faced with this challenge. We asked the Grafana community for advice and the answer was not definitive — there was no ready-made solution. Although there were some signs that it was possible, the path was fairly bleak. Being very eager on Loki and eager to find another appropriate use case for it, we still decided to go forward and give it a try.
So, why Loki?
Loki is written in Golang, it has low memory overhead, it does not maintain in-memory indexes, it is using GCS for storing data, and, on top of that, we wanted to reduce our monitoring tools. Loki complements our existing monitoring solutions such as Grafana and Prometheus. Loki seemed like an organic continuation of what we have already been using and aligned with our goal to simplify our monitoring stack.
What is the actual config?
Get Alina Frolova’s stories in your inbox
Join Medium for free to get updates from this writer.
Our Loki set-up for Kubernetes logs includes Promtail. This configuration allows us to successfully capture all of our clusters’ logs. But surprisingly it did not work in our case with Fastly. We wasted a lot of time attempting to comprehend why it does not work. Eventually, we decided that we should step away from Grafana’s Promtail, and use something proven, something that has been there for a while and that is part of GKE. Fluent Bit enters the stage. Luckily for us, Fluent Bit has a plugin for HTTP endpoints. Also, Fluent Bit is well integrated with Loki. We tried it and it worked really well. Additionally, we were able to add extra labels to the data coming in through Fluent Bit which is great for the usability and observability of logs.
Press enter or click to view image in full size
Implementation
So, if you are geeky enough to be interested in technical details or/and you want to try it out, assuming that you have Loki running already (if not, check out Grafana docs on how to install it), here are the ingredients to this recipe for success:
- If you are planning to use an existing Ingress (if not, your first task will be to create an Ingress), the first step is to create a new DNS record. A DNS record would not be compulsory if you are using a TCP load balancer, but we like the clarity and simplicity of the HTTP load balancer (Ingress) and that approach requires a DNS record.
- As we dwell in the realm of GKE, the next step is to create all the necessary K8s resources: Fluent Bit Deployment, Service, ConfigMap. Following best security practices, GKE users can also add BackendConfig to configure Cloud Armor and restrict access to Fastly IP ranges.
Press enter or click to view image in full size
This is our Fluent Bit ConfigMap. The most important things here are [INPUT]and Labels. Without [INPUT] your Fluent Bit cannot accept any HTTP traffic from the Ingress.
Once we send these logs to Loki using Fluent Bit, we want to make them easily discoverable by assigning them a unique label which is handy when we are searching for the relevant logs in Grafana Loki. In our case, we call it {job="fastly-cdn"}
Press enter or click to view image in full size
3. You have to set up an nginx server for domain ownership verification that is a Fastly requirement . You can read more about it here. Thus, you will need nginx Deployment, Service and ConfigMap.
Press enter or click to view image in full size
4. Yet another central piece is the Ingress. We were using the already existing one, but you can create a brand new one as mentioned above in step 1.
The code snippet of the Ingress below indicates where to send data from Fastly. As you can see, there are two paths: we ship the logs to the Fluent Bit Service and send requests for domain verification to the nginx server:
Press enter or click to view image in full size
5. Do not forget about the valid streaming configuration on Fastly’s side. We are using Terraform, so ours has to be terraformed but it is also possible to do it in Fastly’s UI.
Press enter or click to view image in full size
The url is your DNS record that you created earlier. The most tricky part here is format — you should decide in what format you want to see your logs. We are using an external JSON file for defining the log format for consistency between our BigQuery and HTTPS logging.
Press enter or click to view image in full size
6. All that is left is to send some verification requests to nginx
Press enter or click to view image in full size
7. Send some data (logs) to Fluent Bit to test this whole affair.
So what are the benefits?
Using a few megabytes of RAM and a few mili CPUs, we are getting gigabytes per hour of logs that are immediately available in Grafana. We can embed these logs in dashboards or alert upon. Our log retention period is now infinite as opposed to 3 weeks in ElasticSearch. Last but not least, we are paying just for compute, load balancer, some GCS storage and operations; basically less than a cup of coffee a day. What’s not to like?