Press enter or click to view image in full size
TL;DR: this post simplifies Kubernetes networking by presenting significant concepts and providing a recommended architecture that we call the Golden Path at the Opta team.
HTTP Request Journey
Let’s take the typical example of a person clicking on their profile page on a website to dive in; we assume that Kubernetes is being used.
Press enter or click to view image in full size
In this diagram, the request gets sent through the Internet to a Load Balancer (1), then to an ingress controller pod running on a Kubernetes cluster (2). Finally, one container running a web app receives the request (3) and is tasked to respond. To build the response, that container reaches out to an API service (4), which connects to a database not hosted in Kubernetes (5). The web application also publishes some metrics to a web analytics service that provides a public API on the Internet (6, 7).
While this example is a widespread use case, it involves many different network connections with other network properties. Let’s review them in the following table.
Now that we understand the various types of connections let’s list the components used and their function.
What You Need
- Cloud Provider: A third-party company offering a cloud-based platform, infrastructure, application, or storage services. You pay only for the number of cloud services you use.
- Load Balancer: This is a specialized cloud service to prevent server overload by distributing the traffic across multiple servers.
- Kubernetes: The leading platform for managing containers.
- Ingress Controller: This is another load balancer that distributes the traffic inside the Kubernetes cluster. We will review this one later.
- Kubernetes Service: A logical abstraction for a deployed group of containers.
- Service Mesh: A software architectural pattern used for micro services deployments to enable secure, fast, and reliable service-to-service communications. We will come back on this one as well.
- Other Cloud Services: Most likely, you will need services not hosted in Kubernetes. For example, a database. This is relevant when we look at networking configuration later.
Let’s review some of the most confusing parts of this architecture.
1. What is the difference between the Load Balancer and the Ingress Controller?
These two components serve a similar purpose, distributing the traffic, so why do we need both?
Let’s imagine for a second that we skip the first Load Balancer:
- All public traffic will come directly to the Kubernetes cluster, and the Ingress Controller pod will need to have a public IP, which means your cluster is now public.
- The IP of an Ingress Controller pod can change over time (when the worker node is replaced), meaning that the public DNS would need to be updated when it happens.
- Auto-scaling / High-Availability: By default, a cloud-native Load Balancer will scale according to traffic and use multiple availability zones. While it’s recommended that you configure your Ingress Controller to achieve this, it requires some work.
So by acting as a buffer between the Internet and your Kubernetes cluster, the Load Balancer improves the security of your cluster by keeping it private and provides some advanced features that make it easy to operate, such as public DNS/IP, Auto-scaling, and High-Availability.
2. Do I need a Load Balancer for every service to be exposed or one per k8s cluster?
This is another super frequent question and a confusing one. My take is that the cloud-native Load Balancers (LB) can scale very well (it’s automatic!) so having one public Load Balancer works for most cases. Also, spawning an LB for each service in a micro services architecture would be expensive. So unless you have a specific infrastructure or security requirement, my recommendation is to keep it simple with one LB per cluster.
3. Why do I need a Service Mesh?
Again, I would like to take the angle of what happens if you don’t have a service mesh.
- Each service needs to be coded with some logic to manage service-to-service communication. This will include features like retry, timeout.
- Providing consistent network metrics for the service to service might be challenging. From a network perspective, what happens inside your cluster might be hard to expose if the services manage all the communications directly. You can hack this by adding HTTP headers, but this is extra work.
- If you want TLS between two services, you are now in charge of generating, deploying the TLS certificates, updating the code to use TLS, and importing the self-signed certificates. What about rotating the certificates?
These are some of the most common problems that a service mesh solves. Configuring a service mesh is not easy, but it is done in one place. Your application code doesn’t need to be aware of these networking features. You might be fine without a service mesh if you have a few or some uniform services (one programing language — with a standard network library). But sooner or later, you might want something from the long list of features that a service mesh provides.
What is in the Golden Path?
Now that we have covered the basic networking architecture for Kubernetes, and brought some clarity about the different components and their function, let’s make a recommendation. But, again, take this with a grain of salt. Please feel free to pick and choose what works for you.
Get Remy DeWolf’s stories in your inbox
Join Medium for free to get updates from this writer.
Let’s define the goals that we want to achieve with this solution:
- First, implement network best practices for security, resilience, and scalability.
- It can be deployed on major clouds.
- Favor use vs. build — to get started fast.
- Favor simplicity over complexity — to lower operational cost.
Ok, so now, let’s see what the Opta team has selected for the Golden Path.
- Cloud Provider: Most engineers can’t pick the cloud provider, and it’s not easy to change. At this time, most cloud providers provide similar services. Thankfully, Kubernetes provides some abstraction over the cloud provider, so we can make a recommendation that works on any major cloud provider. This also means that we won’t recommend a component that would only work on a specific cloud.
- Load Balancer: Don’t reinvent the wheel; use a cloud-native Load Balancer service. You will get all the main features from day one, such as auto-scaling and high availability.
- Kubernetes: Again, for 99% of users, you don’t want to manage the Kubernetes cluster yourself; use a managed service like EKS, GKE, AKS. These services come with some components to control the worker nodes as well. The more you can use these managed services, the less work for yourself. Most of the cost of operating a Kubernetes cluster is the compute cost(VMs). Using these managed services is a marginal cost in comparison. And it will save you a lot of time.
- Service Mesh: Here is the first extensive choice. Let’s proceed by elimination. There is some service mesh developed by some cloud providers (Open Service Mesh and AWS App Mesh) that we eliminated since we don’t want to make a recommendation specific to a cloud provider. As a security best practice, we want mTLS support; therefore, Traefik mesh was not considered. From this point, we have two categories of service mesh. On one side, we have Istio and Consul providing many advanced features, which result in more complexity and require more compute resources to operate. On the other side, Linkerd has fewer features but is optimized for performance and usability. Linkerd has some missing parts; it can’t work with the non-Kubernetes workload, Header-based routing and Delay Injection are not supported. The first two are coming in 2022. These are arguably advanced features that many users won’t use. However, it does everything else (see this comparison table). Linkerd design principles to minimize resource requirements work very well when you think that a service mesh relies on deploying a proxy alongside every application container. Our pick is Linkerd.
- Ingress Controller: Linkerd doesn’t come with an Ingress Controller. It supports many; see the list here. Our pick is Nginx: it’s pretty easy to use and is very performant. There is no wrong choice here, but Nginx, the most popular, made sense for us to select it.
Other notable choices for the Golden Path:
- Subnets: The subnets are split into Public and Private. Only the Load Balancer and the Internet Gateway require a public IP; everything else (Kubernetes, Databases) is private. In a micro services architecture in Kubernetes, every service will have many pods. Each Pod uses an IP address. This translates into a high number of IPs required. Sizing your subnets for growth will prevent your subnets from running out of IP addresses. Use CIDR prefix /20 and lower. (see table)
- High Availability: There are many strategies to achieve HA: Multi-cloud, Multi-region, Multi-zones. Most cloud providers guarantee availability for 99.99% in any given region — roughly 5 min of downtime per month. This should work for most users, so we went with Multi-zones, the easiest and cheapest to achieve. We run Nginx and Linkerd in high availability (HA) mode. For every Kubernetes workload, we ensure a minimum of 3 pods to cover each zone. We also configure an Ingress Gateway in each zone.
- Infrastructure as Code: Use Terraform or a Terraform compliant tool. This one is pretty straightforward; the Golden Path should be easily duplicated and distributed. We use Opta, a Terraform compliant tool, which implements this network architecture natively (see documentation). Opta can be used to provision this infra stack on the cloud provider or generate the terraform files.
Here is the same diagram we had earlier but it now includes the choices listed in the Golden Path.
Press enter or click to view image in full size
The main features are:
- Public and private subnets to ensure isolation.
- High availability to provide resiliency in case of an availability zone outage. 99.99% SLA, guaranteed by the major cloud providers.
- Service mesh features are Automatic mTLS, Load Balancing, Retries and Timeouts, Traffic Split, etc.
- Simplicity and Resilience: This solution is pretty straightforward to operate. The cloud provider manages the Load Balancer, the Database, and the Kubernetes cluster. Both Nginx and Linkerd are optimized for performance and usability.
What’s missing?
You might also consider a few add-ons such as:
- A content delivery network (CDN) to serve assets needed for loading Internet content, including HTML pages, javascript files, stylesheets, images, and videos. Each cloud provider would have such a service, and they are cheaper to operate than going to your application to serve these static assets.
- Configure Network Policies within the Kubernetes cluster: By default, Linkerd allows all traffic to transit in the mesh, but this can also be configured; see this guide.
Finally, what would you have done differently?
Now that we have presented the Golden Path for Kubernetes Networking. What else do you think should be considered? Do you have some experience that you would like to share?
Please add your comments below. In addition, other folks from the Opta team or I will be happy to discuss these topics.