Last updated on
#Kubernetes
#Rant
#AWS
Intro
When learning something, I need to experiment a lot. All of this happens in labs, and most of the time they are provided by my employer, where resources aren’t really something I have to think about. But recently, I had to pay for my own labs out of pocket, and oh boy, did I feel it.
Well, not really, I haven’t been a broke person paying for labs, but every time I’ve spun up some lab, there was that anxiety in my mind: how much does this cost? Did I turn it off, or have I left something running ? Can I delete this now, or will I need it later ? Oh, I need to test more, but I’ve already removed the lab and don’t want to bring it back, etc.. It was super annoying, I had to fix it, which I did.
So this is my take on how to build a lab that will cost so little you won’t have to worry about it.
I do have to clarify something: this post is not about a person who has no money, such as a student just starting out studying for a job. There are a lot of options to study for free, including credits for labs, both vendors and third-party companies provide such options.
I am not a broke person who found a loophole for free compute; what I am is an obsessive, anxious engineer with a knack for overcomplication. I look at something and overengineer it to the point of obsessiveness.
Regardless, this post is for anyone who wants to study systems and how they are engineered, and it is more about that, than saving a buck.
Target labs
The nature of labs and what we want to learn from them are a major contributor of the cost and study path, as stated this blog is not a way to acquire credits or funding. I am not going to cover a lot of possible labs that heavily rely on specific SaaS services; such services are priced by providers as they see fit, with dedicated teams for promoting and training their clientele. If a targeted lab heavily relies on such offerings, please consider contacting the provider for credits, study paths and other learning resources.
What I discuss in this blog is how to utilize compute resources in AWS to spin up self-hosted services.
Why self-hosted? Self-hosting something is a sure way to learn about it. All of this lab talk comes from me learning something, this is probably the case for others as well. We are creating labs to experience our theoretical knowledge in practice. This is the most important part of the blog, while the cost side is simply that, a cost we pay for it.
Why AWS ? They provide an API for everything, and cost is calculated based on usage. This makes labs on AWS very, very cheap compared to other offerings that might offer 1/10th of AWS’s price for the same compute.
Does that make any sense? Well, it should, but here’s an example.
If AWS has a monthly cost of 4.5 USD for a 1/1 CPU/RAM compute node, and another platform has the same compute node for 0.50 USD on paper, the second offering seems cheaper, right?
Sure, it is, but most of the time such offerings are prepaid or fully paid per month, meaning if I spin up 20 nodes of that 1/1 CPU/RAM node on AWS for 5 hours, the cost will come out to about 0.06 USD, while the other platform will charge me for the whole month (10 USD) even if I remove the node after those 5 hours.
This can be true for other clouds as well, that’s true; AWS does not hold a singular, unique advantage over other clouds. No, it’s more of a whole-package thing:
Infrastructure-as-a-Service APIs; well-documented and widespread in the community; complex and mature services with a lot of flexibility; an experience that can be referenced and used in an actual job; etc..
Just make sure, when choosing a cloud service, to choose a serious player who is mature enough to provision everything automatically. This is in contrast to smaller players who might offer a service for much less but try to automate their offerings and end up manually approving or even provisioning resources will prove difficult. I once had an experience with a cloud provider that gave me a duplicate IP of another user’s machine, I have no idea how to account for such issues and fail to see how it would cost less after accounting unexpected problems, even if I don’t account for wasted time.
Cloud Resources
Dynamically managing resources is mandatory to achieve cheapness at the level where we don’t even feel the cost; it is truly the key in the equation. We have to provision a bunch of resources and remove them as soon as possible, and I don’t mean provisioning something for weeks and then removing it, no, no, we are talking hours here.
Provisioning something for hours certainly has its own quirks, with one mandatory thing originating from the nature of the targeted usage of said labs itself. When we provision a lab for study, we usually have to invest some time working on parts of the lab that have nothing to do with what we are studying just to get to the study part, and redoing that over and over just to save a buck seems so annoying.
But in actuality, this is a blessing in disguise. Just think about it this way: we don’t want to do anything manually, especially if we have to repeat it, so we are going to automate it. And to automate it, we have to understand it on such a level that we learn the inner workings of our study target from an outside perspective.
Really, I am very serious about this. Applications are created similarly; there is so much architectural difference out there, and the majority of the time these applications will share behaviors, things like how they store data, cache it, talk with other systems, or whatnot. And if we understand this for a few systems out there, not only will we have much better knowledge of those systems, but we can pretty much adjust the formula for others.
When provisioning a lab to study, we ignore all of this. We open some documentation provided by the vendor on how to spin up a demo environment or some getting started guide, and while they are definitely helpful for getting started fast, they lack a lot when it comes to really understanding the system itself.
instead of running a single Docker Compose command that will download a monstrosity of applications and present a web interface for us to access, where we are missing a lot and have to imagine magical unicorns doing their bubu baba magic to perform a simple application routine, I offer to be a blockhead, just like me.
To analyze the application’s inner workings and automate each and every resource with their dependencies.
every time we spin up our environment it will provision everything as stated by us and then demolish them.
This way we will both study and save money, and trust me, nothing feels better (in this field?) than that feeling when you want to change some part of system and your brain basically simulates the whole exchange with implemented system. There are no magical unicorns, why give them any power at all?
The tool for dynamically provisioning labs is Terraform. Now, this post really is not about Terraform, I will not explain how it works or argue alternatives, but I have a hard time imagining a scenario where this tool changes. Also, don’t worry about knowing Terraform; it is mandatory for what I will do in this post, and yes, to use it you have to know it. But trust me on this: there is only one way to learn Terraform, and that is by using it. So why not start here?
Engineers who are experienced in automation out there, sorry, but you are not using the tool you are familiar with. I have had to hear so much from engineers who already know X and refuse to use Y, thus butchering X’s functionality to technically cover for what Y would provide with their precious X. Get out of your comfort zone and get okay with it. Study never stops; it will always have to be done one way or another. And no, there is no such thing as useless knowledge, everything has a use. You can study inefficiently, but you cannot study nothing. And if you are not able to tell yourself how to study efficiently, then you are not allowed to refuse to study inefficiently. Be patient, study, and time will come when you don’t have to question or argue your choice.
I am not here to cover Terraform, but I have to create some common ground for the reader to follow. Terraform is an overengineered tool that allows us to define targeted infrastructure in a declarative way. The declaration is feed to Terraform cli tool, which will create an tree of dependencies that will error-check and predict the end state of our declared infrastructure, creation of resources is handled via APIs of the hosting platform. Meaning, we can tell Terraform how many instances we need, where they will sit, and how they will be configured. Terraform will analyze our meaning and, if possible, realize it by asking AWS to create each required service as needed in the correct order. After our meaning is realized, it will manage the state of created resources and their lifecycle.
This allows us to create very complicated sets of dependencies that can be created within a minute or two and later cleaned up just as fast. This is a very powerful utility to have, especially when modifying declared infrastructure. Where it falls short is in configuring the operating system.
Terraform’s bread and butter is calling an API. Creating an instance ? Sure, we can call an API that will tell us if creation is possible and, if so, create it for us. But telling the OS what to do ? Well, there is no API for that. We can feed commands to the instance, but it does not have that exquisite feedback provided by APIs, so we have to keep it very simple. Unfortunately, simple is not enough to build a lab. Here we have to get creative. Normally, we use a lot of different tools to generate configuration files, but due to our limited options, we kind of have to generate files outside of the OS instead and feed the generated files.
Does that sound scary? Yeah, it does, I know, but it’s not scary at all. It does get complicated, but trust me on this.
Here I have to introduce another tool: CoreOS, an immutable operating system configured using Ignition files.
CoreOS is not actually mandatory for this, but it is certainly fit the usage. This all comes down to the way the OS is configured: CoreOS is an immutable system where configuration needs to be preloaded via Ignition files, and because of this nature, we can, and are forced to, automate services in a simple manner instead of, you know, creating a machine and running a bunch of Ansible/Bash scripts to create a service. No, no, we have to suffer so that what we create (which will never be seen by another person) is a minimalistic marvel that will work years after writing, even though it will be collecting dust somewhere in a private Git before we forget our credentials.
CoreOS will act as our operating system. We will configure it by feeding prerendered configuration files. This allows us to create files, services, transfer data, configure access, and so on.
This should raise a question: what if my choice of lab has a binary program that takes control of nodes and configures them appropriately? Here our approach falls short. Running a binary with Terraform is a horrible experience, it can be done, but I don’t recommend it.
But this is a non-issue. While it certainly can be a blocker, most of the time we can and should do what this binary does ourselves. Yeah, I know, crazy concept, right? Give me time; I’ll get there.
Another alternative to Ignition type configuration can be CloudInit, but I prefer Ignition for how simple and punishing it can be, either reliably working or simply not.
Making progress
With the introduction and crazy talk out of the way, I want to get in there and demonstrate what I’m on about.
For a demonstration, we will follow a scenario, a scenario to create a very simple Kubernetes cluster. This will allow me to demonstrate my reasoning for why the presented approach is awesome and totally not me coping with the amount of hours I wasted obsessing over stupid stuff. No, this is important and was needed. Everything for education, right? This is what I should be using my time for. No issues here…
Yes, scenario, right. Kubernetes, really a monster of a thing to choose here. I could and should have gone for something simpler, but who cares, I’ll suffer some more.
So let’s do a scenario like this: provisioning a simple three-node cluster, bare bones, no networking, no nothing, just a Kubernetes cluster.
Before tackling the Kubernetes cluster, we should create some servers, just a few where we can go in and run stuff.
Servers would need a network, let’s create that first.
The Virtual Private Cloud (VPC) is Amazon’s take on software-defined networks. It allows us to define subnets, assign addresses, configure routing tables, etc.
VPC allows us to experiment with network-isolated scenarios, which will also come up in our real work where we have to comply with air-gapped environments for government and enterprise entities. But leaving specific scenarios aside, it will also save us money.
resource "aws_vpc" "lab" {
cidr_block = "10.10.16.0/20"
}
resource "aws_subnet" "instance_subnet" {
vpc_id = aws_vpc.lab.id
cidr_block = "10.10.16.0/24"
}
This gives us the VPC with a defined range of 10.10.16.0-10.10.31.254. Under that, we created an isolated subnet of 10.10.16.0–10.10.16.255, where we will put our instances.
Let’s put some EC2 instances inside our glorious VPC.
resource "aws_instance" "lab_instance" {
count = 2
ami = local.ami
instance_type = "t2.micro"
key_name = local.ssh_key
network_interface {
network_interface_id = aws_network_interface.instance_int[count.index].id
device_index = 0
}
}
resource "aws_security_group" "instance_sec" {
name = "instace_sec"
description = "sec group for the instance server"
vpc_id = aws_vpc.lab.id
}
resource "aws_network_interface" "instance_int" {
count = 2
subnet_id = aws_subnet.instance_subnet.id
security_groups = [ aws_security_group.instance_sec.id ]
}
Now we have two instances, but they only have internal IPs assigned by the VPC. We cannot really interact with them; they can talk to each other, but that’s pretty much it.
So far, the VPC hasn’t cost us anything, we are only paying for the created instances and a small additional cost for traffic (if generated).
Accessing a host can be done in a few ways. For example, we can attach a public IP to the instance, but this will cost an additional $0.50 a month per instance. Fortunately, we are not stuck with this cost.
The problem of available public IPs should be familiar to system and network engineers, and they are also familiar with workarounds such as IPv6, NAT, and VPN services.
Let’s focus on NAT (Network Address Translation), where we route traffic via a single public IP. The NAT gateway, where traffic is terminated, tracks calls and records them in a NAT table. Thus, when our internal instance accesses the internet, it will hit the NAT gateway, which in turn takes note of the source IP/port, modifies the source IP to the public IP, and sends the modified packet to the destination IP. Later, when we get a response, the NAT gateway will receive it on the public IP (previously modified source), then look at the NAT table and similarly handle the destination IP, replacing it with the internal IP that matches the previously recorded source port as the current destination port.
AWS provides its own implementation of NAT as a VPC feature, but it costs money, around $34/month ($0.045/hour), plus an additional $0.45 ($0.045 per GB) for 10 GB of traffic. Well, that’s too much money for our little lab.
So what can we do as an alternative? Simple: we can create a single instance with a public IP, configure NAT rules on it, and then tell the VPC to route traffic through that instance, which will in turn route traffic to the internet via its own public IP. That should come out to about $5 a month (t2.nano, $0.0058/hour, public IP, $0.005/hour). Little to no money.
We can put this instance in the same subnet, but I will create another subnet called GW, just to have clear roles.
resource "aws_subnet" "gw_subnet" {
vpc_id = aws_vpc.lab.id
cidr_block = "10.10.31.0/24"
}
we also need Public IP, and some routing.
resource "aws_eip" "nat_public_ip" {
instance = aws_instance.gw_instance.id
domain = "vpc"
}
resource "aws_internet_gateway" "internet_gw" {
vpc_id = aws_vpc.lab.id
}
resource "aws_route_table" "gw_routing" {
vpc_id = aws_vpc.lab.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.internet_gw.id
}
}
resource "aws_route_table_association" "gw_rb" {
subnet_id = aws_subnet.gw_subnet.id
route_table_id = aws_route_table.gw_routing.id
}
resource "aws_security_group" "gw_sec" {
name = "gw_sec"
description = "sec group for the gw server"
vpc_id = aws_vpc.lab.id
}
resource "aws_network_interface" "gw_int" {
subnet_id = aws_subnet.gw_subnet.id
private_ips = ["10.10.31.10"]
source_dest_check = false
security_groups = [ aws_security_group.gw_sec.id ]
}
now we can create Gateway instance, and define access rules.
resource "aws_instance" "gw_instance" {
ami = local.ami
instance_type = "t2.micro"
key_name = local.ssh_key
network_interface {
network_interface_id = aws_network_interface.gw_int.id
device_index = 0
}
}
resource "aws_vpc_security_group_ingress_rule" "gw_allow_ssh" {
security_group_id = aws_security_group.gw_sec.id
cidr_ipv4 = "0.0.0.0/0"
from_port = 22
ip_protocol = "tcp"
to_port = 22
}
resource "aws_vpc_security_group_egress_rule" "gw_allow_outbound" {
security_group_id = aws_security_group.gw_sec.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1"
}
resource "aws_vpc_security_group_ingress_rule" "instance_allow_ssh" {
security_group_id = aws_security_group.instance_sec.id
cidr_ipv4 = "10.10.31.0/24"
from_port = 22
ip_protocol = "tcp"
to_port = 22
}
resource "aws_vpc_security_group_egress_rule" "instance_allow_outbound" {
security_group_id = aws_security_group.instance_sec.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1"
}
Okay, cool, now we have an instance that has both a public IP and an internal IP.
We can directly SSH into this machine and access our other instances from there, which is cool, but the other instances still don’t have access to the internet. And honestly, accessing one machine just to use it to SSH into yet another machine is annoying.
Let’s address the internet part first, and while we’re at it, let’s also introduce some drawings to make sure my rambling makes a bit more sense.
Here we have three machines and two subnets: the GW machine, which is part of the GW subnet, and two machines that are part of the instance subnet.
Leaving access lists aside, the subnets are already aware of each other and able to communicate via a router.
Well, not really, this is SDN, not a legacy network. In an SDN scenario, it’s more of a distributed set of edge devices taking action than a router sitting in the middle. But for the sake of keeping the explanation simple, suspend disbelief and focus on the idea itself.
Now let’s introduce the public IP to the drawings. Previously, I mentioned the GW instance hosting two IPs, this is technically true, but in practice, we are bound by VPC rules. In this case, what we actually get is a single IP on the instance, the internal IP assigned by the VPC.
When we attach a public IP to the instance, AWS goes to an entity called the IGW (Internet Gateway) and puts the assigned IP on it. The IGW acts as a NAT gateway by creating a static NAT rule to translate the assigned public IP to the internal IP of the instance, and vice versa.
The IGW only does static translation of IPs, mapping a single public IP to a single internal IP. To keep costs low, we have to do dynamic translation using an IP/Port combination.
This is where our GW host comes into play. We modify the routing table to send traffic to that host, which will perform dynamic translation and then route it to the IGW, where static translation happens.
So, kind of like this.
Now we have to tell Linux not to drop packets with a destination IP that doesn’t match its own IP, and instead NAT them.
This can be done with two commands.
sysctl -w net.ipv4.ip_forward=1
iptables -t nat -A POSTROUTING -o enX0 -j MASQUERADE
Ignition version.
variant: fcos
version: 1.6.0
storage:
files:
- path: /etc/sysctl.d/90-ipv4-ip-forward.conf
mode: 0644
contents:
inline: |
net.ipv4.ip_forward = 1
systemd:
units:
- name: nat-rule.service
enabled: true
contents: |
[Unit]
Description=nat rule
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/iptables -t nat -A POSTROUTING -o enX0 -j MASQUERADE
[Install]
WantedBy=multi-user.target
Okay, now we have a functioning NAT gateway, and our instances have internet access.
Let me also drop a link to the GitHub repository with this specific code. github
| Cost | 0.015$(Hour) |
| Cost under free tier | free (250 Hours, per month) |
| Time to provision lab | minute or two |
| Time to Destruct lab | minute or two |
| Coammnds to init Lab | Two |
| Commands to Create/Destruct Lab | Single |
cost calc: (0.0047(t3a.nano) * 3) + 0.0007(Public IP)
To test the instances, just SSH into instance-01 or instance-02 and try accessing the internet.
This can be done by using the GW instance as a jump host:
ssh core@<instance 01/02 internal IP> -J core@<gw public ip>
Improving Experience
Using the GW as a jump host is annoying, doing it for SSH is bearable, but as soon as we introduce web based applications, it becomes unusable.
To solve this, we have to implement a VPN solution.
A Virtual Private Network allows us to interact with our private network in a secure manner. This is done by introducing a virtual interface in our operating system; that virtual interface forwards traffic to a local application, which in turn handles routing of said packets to our private network over the internet.
We have a wide variety of VPN solutions. AWS itself offers one, but as expected, it costs money, around $73 a month per user ($0.10 an hour). This is too much. Like with the NAT gateway, we should set up our own VPN service.
WireGuard is one of the best options for our scenario. It is easy to set up, has little to no latency, and provides everything we might need.
Basically, install and configure WireGuard on a server, then install and configure the client on your laptop.
We need an instance to set up this service. To keep costs down, we will run it on the same host we used for NAT.
Usually, we have to install the server binary on the host, but CoreOS includes WireGuard binaries out of the box.
The only thing left is configuration.
here is a simple wireguard configuration:
[Interface]
PrivateKey =
Address = 10.70.88.0/24
ListenPort = 443
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o enX0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o enX0 -j MASQUERADE
[Peer]
PublicKey =
AllowedIPs = 10.70.88.101
it contains two sections, Interface and Peer.
- Interface: contains main configuration
- PrivateKey of server, generated with
wg genkey - Address, allowed IP addresses for the peer
- Listening, the Port server will bound to
- PostUp, command run when services starts
- PostDown, command run when services stops
- PrivateKey of server, generated with
- Peer: contains configuration of peer(Client)
- PublicKey of Peer, generated with
wg pubkey - AllowedIps, list of allowed IPs for specified Peer
- PublicKey of Peer, generated with
and configuration for client,
[Interface]
PrivateKey =
Address = 10.70.88.101/24
[Peer]
PublicKey =
AllowedIPs = 10.10.16.0/24, 10.70.88.0/24
Endpoint = <Public IP>:443
again, contains two sections, Interface and Peer.
- Interface: contains main configuration
- PrivateKey of Client, generated with
wg genkey - Address, Static IP set on Virtual Interface
- PrivateKey of Client, generated with
- Peer: contains configuration of Peer(Server)
- PublicKey of Peer(server) generated with
wg pubkey - AllowedIPs, list of subnetes to be routed via Virtual Interface
- Endpoint, Public IP of server
- PublicKey of Peer(server) generated with
simple right? basicly same as server, but reverse.
lets put this in work,
variant: fcos
version: 1.6.0
storage:
files:
- path: /etc/sysctl.d/90-ipv4-ip-forward.conf
mode: 0644
contents:
inline: |
net.ipv4.ip_forward = 1
- path: /etc/wireguard/wg0.conf
mode: 0600
contents:
inline: |
[Interface]
PrivateKey =
Address = 10.70.88.0/24
ListenPort = 443
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o enX0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o enX0 -j MASQUERADE
[Peer]
PublicKey =
AllowedIPs = 10.70.88.10
systemd:
units:
- name: wg-quick@wg0.service
enabled: true
thats it, server is configured.
github with this specific code. github
| Cost | 0.015$(Hour) |
| Cost under free tier | free (250 Hours, per month) |
| Time to provision lab | minute or two |
| Time to Destruct lab | minute or two |
| Coammnds to init Lab | tenish |
| Commands to Create/Destruct Lab | Single |
cost calc: (0.0047(t3a.nano) * 3) + 0.0007(Public IP)
cost have not increased, init lab commands went up a lot, but that can be easily fixed.
now we can directly access internal IPs of VPC, awesome.
Back to Scenario
Now that we have instances and some idea of how to provision the lab, we can finally focus on the Kubernetes part of the scenario. As our scenario is a bare-bones Kubernetes cluster, let’s hit the source of Kubernetes.
Depending on our understanding of Kubernetes, we will either jump straight to setting it up or just analyze how it works on a theoretical level.
I don’t want to explain Kubernetes here, but given the bound complexity and variety of the target actor (Kubernetes), I’m forced to create a leveled playing field. Why did I even pick it…
Anyways, Kubernetes was created to address issues that would emerge during the development and operation of applications. Basically, when we want to develop or run an application, we’re setting up a bunch of resources that will be used by that application, for example: DNS records, SSL certificates, compute, operating systems, networking, etc.
Without Kubernetes, we configure all of these separately; most of the time, they have no direct relationship with each other and are controlled by inherently different actors (network, systems, security, etc.).
This caused a lot of issues where the development and distribution of applications slowed to a halt, with plenty of finger-pointing. Kubernetes addresses this by introducing a platform where each such resource is managed by a centralized system that exposes an API to control the process as needed.
How Kubernetes achieves this is through complex distributed systems following an outline created by the Kubernetes team. This outline has changed a lot over the years, but the idea remains the same.
Now, with little to no knowledge, let’s try to spin it up. The documentation recommends the kubeadm utility. This utility runs on a node, takes some parameters, and spins up a cluster for us. Nice, right? Well sure, but before I start my rant, let me actually spin up a cluster this way.
The database of Kubernetes has a quorum-type architecture, meaning we need to have an odd number of instances. Let’s address that first and while we are at it move to variables and loops to reduce duplicated Terraform code.
variable "control-instances" {
type = map
default = {
control-01 = { ip = "10.10.16.11" },
control-02 = { ip = "10.10.16.12" },
control-03 = { ip = "10.10.16.13" },
}
}
resource "aws_instance" "control" {
for_each = var.control-instances
ami = local.ami
instance_type = "t2.micro"
key_name = local.ssh_key
network_interface {
network_interface_id = aws_network_interface.control_int[each.key].id
device_index = 0
}
}
resource "aws_network_interface" "control_int" {
for_each = var.control-controls
subnet_id = aws_subnet.control_subnet.id
security_groups = [ aws_security_group.control_sec.id ]
private_ips = [each.value["ip"]]
}
Before running kubeadm, we have a set of prerequisites to handle:
- Swap needs to be off
kubeadmexpects a container runtime to be present- Network access has to be granted between nodes
- We have to install Kubernetes binaries:
kubeletandkubeadm - Nodes should be able to resolve each other by hostname
- A load balancer for the API
Lucky for us, CoreOS is an operating system optimized for containers.
Swap is off by default; the container runtime is already present as the containerd package (I will use cri-o); and access is controlled via security groups.
This leaves us with the load balancer, Kubernetes binaries, and name resolution.
We can set up the load balancer like the VPN/NAT service or simply use AWS’s ELB. As the ELB service doesn’t add much cost, I’ll go with that.
resource "aws_security_group" "kube_lb" {
name = "kube_lb"
description = "sec group for the lb server"
vpc_id = aws_vpc.kube.id
tags = {
Name = "kube_lb"
}
}
resource "aws_vpc_security_group_ingress_rule" "kube_lb_allow_kapi" {
security_group_id = aws_security_group.kube_lb.id
cidr_ipv4 = "10.10.16.0/24"
from_port = 6443
ip_protocol = "tcp"
to_port = 6443
}
resource "aws_vpc_security_group_ingress_rule" "kube_lb_allow_kapi_p" {
security_group_id = aws_security_group.kube_lb.id
cidr_ipv4 = "10.10.31.0/24"
from_port = 6443
ip_protocol = "tcp"
to_port = 6443
}
resource "aws_vpc_security_group_egress_rule" "kube_lb_allow_outbound" {
security_group_id = aws_security_group.kube_lb.id
cidr_ipv4 = "10.10.16.0/20"
ip_protocol = "-1"
}
resource "aws_lb" "kube" {
name = "kube"
load_balancer_type = "network"
internal = true
security_groups = [ aws_security_group.kube_lb.id ]
subnet_mapping {
subnet_id = aws_subnet.instance_subnet.id
private_ipv4_address = "10.10.16.10"
}
}
resource "aws_lb_target_group" "kube-control" {
name = "kube-control"
port = 6443
protocol = "TCP"
target_type = "ip"
vpc_id = aws_vpc.lab.id
ip_address_type = "ipv4"
health_check {
port = 6443
protocol = "TCP"
}
}
resource "aws_lb_listener" "kube_api" {
load_balancer_arn = aws_lb.kube.arn
port = 6443
protocol = "TCP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.kube-control.arn
}
}
resource "aws_lb_target_group_attachment" "kube-controlers" {
for_each = aws_network_interface.control
target_group_arn = aws_lb_target_group.kube-control.arn
target_id = each.value.private_ip
port = 6443
}For name resolution, we can either spin up a DNS service or use AWS Route 53.
AWS Route 53 is cheap, but it’s charged at a flat rate (I think $0.50?) per month after the first 24 hours are up (up to 24 hours, the service is free and considered as testing). However, the cost can add up(i think ?) considering how many times we’ll create and tear it down.
This can be addressed in two ways: either have a permanent DNS zone and tell Terraform to simply add or remove records under it, or remove any zones before the 24-hour mark is up.
Spinning up a DNS service is simple, it can be done similarly to the NAT/VPN setup, but there’s actually a third option: simply generate a hosts file and distribute it. I’ll use that one here.
variant: fcos
version: 1.6.0
storage:
files:
- path: /etc/hosts
mode: 0644
overwrite: true
contents:
inline: |
127.0.0.1 localhost
10.10.16.10 api.kubelius
10.10.16.11 control-01
10.10.16.12 control-02
10.10.16.13 control-03
Installing binaries is a weird experience in CoreOS. The operating system is designed as an immutable system; installing things isn’t as simple as running apt or dnf install.
CoreOS recommends not modifying the OS itself, instead to run everything as a layer on top, in simpler words, as containers.
But we’re talking about low-level tools that will later allow us to manage containers, we have to install these packages directly. This can be done either by layering packages onto the base OS (similar to a Dockerfile) or by using OSTree to modify the underlying image.
The OSTree option is simpler in this case, let’s do that.
I don’t want to run the install command manually, let’s put it in an Ignition config. This is actually kind of tricky, the Butane configs aren’t meant to be used for dynamic automation; they’re more static in nature, like creating files, configuring OS services, and so on.
Installing a package is dynamic, we need to run a command that will dynamically analyze dependencies and pull them from the internet.
As a workaround, we can generate a script with our command in it, then create a systemd oneshot service that calls our install script.
A oneshot service is meant to run a command that exits after its job is done. To make sure it only runs a single time, we can introduce some persistent state that checks before the script is executed and skips it if needed, for example, creating a file after our script finishes installation and telling the oneshot service to run only when that file is absent.
variant: fcos
version: 1.6.0
storage:
files:
- path: /etc/yum.repos.d/kubernetes.repo
mode: 0644
contents:
inline: |
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/repodata/repomd.xml.key
- path: /etc/hosts
mode: 0644
overwrite: true
contents:
inline: |
127.0.0.1 localhost
10.10.16.10 api.kubelius
10.10.16.11 control-01
10.10.16.12 control-02
10.10.16.13 control-03
- path: /etc/sysctl.d/kube.conf
mode: 0644
contents:
inline: |
net.ipv4.ip_forward=1
systemd:
units:
- name: rpm-ostree-install-kube.service
enabled: true
contents: |
[Unit]
Description=Layer with kubelet and crio
Wants=network-online.target
After=network-online.target
# We run before `zincati.service` to avoid conflicting rpm-ostree
# transactions.
Before=zincati.service
ConditionPathExists=!/var/lib/%N.stamp
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/rpm-ostree install -y --allow-inactive kubelet kubeadm crio
ExecStart=/bin/touch /var/lib/%N.stamp
ExecStart=/bin/systemctl --no-block reboot
[Install]
WantedBy=multi-user.targetOkay, this gives us all the prep, well, kind of. We haven’t told each instance what its hostname is. This is important because the nodes will advertise their hostname as their node name.
We’re providing a static file to all control nodes, meaning the hostname will be the same for all of them, we have to address this.
Well, what I’m about to introduce is kind of complicated, especially for something as simple as setting a hostname, but we’ll need to do similar stuff for other parts of the lab, might as well get it over with.
What am I talking about? Generating Ignition files. To generate an Ignition file, we write a human-readable declaration in YAML format, then feed this YAML file to the Butane tool, which generates a JSON manifest (Ignition).
We cannot ask Terraform to template YAML. Stop. Let us manually generate the Ignition files from templated YAMLs and ask it to continuou from there.
We have a few options here. The first should be to check if Terraform has an integration for Butane, and indeed, there are providers for it. Unfortunately for us, they are community providers.
The community provider for Terraform gives me anxiety. Why, you ask?
When we get the option of extending functionality via third-party additions, that option needs to be implemented exceptionally.
Think of it this way: when we use a third-party provider, we have to make sure it isn’t harmful, intentionally or unintentionally. How?
Well, here are a few ways I can think of:
Simply trust the source of the provider. For example, if a big company is behind the provider, like AWS, we can (kind of) trust it.
Or trust the publisher, in this case, HashiCorp, but that depends on how much responsibility the publisher takes for published content.
What do I mean? Well, if HashiCorp simply allows anyone to publish a provider and directs users to use it with caution because they take no responsibility for what’s published, then obviously that option is not viable.
What’s the alternative? Well, let’s look at other community stores. For example, Red Hat hosts the OpenShift Operator Marketplace. The marketplace has tiers of published operators: starting with Community, where Red Hat takes little to no responsibility; then Partners, where Red Hat takes responsibility for verifying the source of operators by partnering with them, probably putting them under contract with restrictions and responsibilities; and finally, the Certified operator, where Red Hat provides guidelines on how the operator needs to be implemented and tested, including review/audit from Red Hat.
We can also trust the engineering, meaning the platform where third-party code runs is engineered in such a way that the code cannot act maliciously without explicit permission.
When we install an application on our mobile device, we’re not worried about its code. Why? Because the application is constrained within the OS environment.
For example, on iOS, if an application wants to record our voice, it has to ask for permission, and when recording starts, iOS provides a visual indication that recording is taking place. The same goes for the camera, file access, and so on.
I have to clarify that I’m not familiar with how HashiCorp addresses this, but I can say this: with the little research and limited information HashiCorp provides on the topic, I’m not willing to use or recommend community providers.
So, what do I recommend? Well, as always, be a blockhead.
Terraform provides the External data source, which allows us to define a wrapper around shell scripts. It calls a local executable file, passes JSON data, and expects JSON data back.
How can we use it here? Simple: we can have Butane installed locally and call it via the External data source, passing the templated Butane config and expecting Ignition data back.
Now, this isn’t as simple as calling Butane and piping data, but it’s still straightforward.
What complicates things is the way the External data source passes and expects data back. It uses JSON, and JSON isn’t the standard format Unix-type tools use to pass data. Unix tools rely more on raw stdin/stdout data streams.
To use this with Butane (or any other program that doesn’t support JSON input/output), we first have to pass that data to something that will convert it to raw data, send it to Butane, capture the output, and return it to the External provider.
Let me demonstrate. First, we have to create a script.
#!/bin/bash
set -e
eval "$(jq -r '@sh "CONFIG64=\(.config64)"')"
BUTANE64=`echo $CONFIG64 | base64 -d | butane | base64`
jq -n --arg config "$BUTANE64" '{"base64":$config}'
this will accept, json, extract it via jq and passit to butane, capture output, and return it as single level json {"base64":"igntion-data"}.
now we can interact with this from terraform code via external data source.
like this;
data "external" "ignition_control" {
for_each = var.control-instances
program = ["sh", "scripts/butane.sh"]
query = {
config64 = base64encode(templatefile("templates/control.tftpl", {
name = each.key
}))
}
}
now we can introduce dyamic data to butane config.
variant: fcos
version: 1.6.0
storage:
files:
- path: /etc/hostname
mode: 0644
contents:
inline: |
${name}
and finnaly feed it to instance.
resource "aws_instance" "control" {
for_each = var.control-instances
ami = local.ami
instance_type = "t2.micro"
key_name = local.ssh_key
user_data_base64 = data.external.ignition_control[each.key].result.base64
network_interface {
network_interface_id = aws_network_interface.control_int[each.key].id
device_index = 0
}
}
Now, with the instances prepared, we can start the initialization of the cluster.
kubeadm init --config /etc/kube-cluster.config --ignore-preflight-errors=NumCPU,Mem
This will bootstrap the first node of the cluster. After the initial setup is done, it will output instructions to join the other control nodes.
something like copying the certificate files and running the kubeadm join command a token.
ssh core@10.10.16.12 "sudo mkdir -p /etc/kubernetes/pki/etcd/"
ssh core@10.10.16.13 "sudo mkdir -p /etc/kubernetes/pki/etcd/"
files=(
/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/ca.key
/etc/kubernetes/pki/sa.key
/etc/kubernetes/pki/sa.pub
/etc/kubernetes/pki/front-proxy-ca.key
/etc/kubernetes/pki/front-proxy-ca.crt
/etc/kubernetes/pki/etcd/ca.key
/etc/kubernetes/pki/etcd/ca.crt
)
mkdir -p /tmp/rsyncc/etc/kubernetes/pki/etcd/
for i in "${files[@]}"; do
rsync --rsync-path="sudo rsync" core@10.10.16.11:$i /tmp/rsyncc$i
rsync --rsync-path="sudo rsync" /tmp/rsyncc$i core@10.10.16.12:$i
rsync --rsync-path="sudo rsync" /tmp/rsyncc$i core@10.10.16.13:$i
done
now we can run join command
kubeadm join api.kubelius:6443 --token <token> --discovery-token-ca-cert-hash <sha> --control-plane --ignore-preflight-errors=NumCPU,Mem
retrive superadmin config.
#add in local hosts file.
#10.10.16.10 api.kubelius
rsync --rsync-path="sudo rsync" core@10.10.16.11:/etc/kubernetes/super-admin.conf /tmp/kubelius.conf
export KUBECONFIG=/tmp/kubelius.conf
And voila, we have a Kubernetes cluster! Granted, it’s only the masters, no networking, ingress, UI, etc… but it’s a fully functional Kubernetes cluster.
The other components are additions to Kubernetes, not part of what makes the Kubernetes.
| Cost | 0.045$(Hour) |
| Cost under free tier | free (180 Hours, per month) |
| Time to provision lab | minute or two |
| Time to Destruct lab | five minutes |
| Coammnds to init Lab | tenish |
| Commands to Create/Destruct Lab | Single |
cost calc: (0.0047(t3a.nano) * 4) + 0.0007(Public IP) + 0.025(ELB)
Rant
Let’s start by addressing the obvious annoyance of running commands on nodes to provision a cluster. It gets annoying so fast, and every time I’ve had a similar setup, I’ve dreaded starting it up.
Of course, we can automate kubeadm commands, but automating commands on Linux is such a hassle. The cluster will become unstable, not provision correctly, or not function consistently, in short, it won’t solve the underlying issue.
Leaving all of that aside, what did we just provision? How does it work? How are nodes talking to each other? Where is the data stored? How are nodes authenticated?
Provisioning the lab certainly didn’t provide any of these details. Creating the infrastructure gave us some information, but to troubleshoot it? To discuss and describe how it’s working? Forget it, when an issue arises, we’ll be left scratching our heads.
This is my main issue with engineers learning systems: we provisioned Kubernetes but have no idea how it works. kubeadm stole precious experience from us.
Now, let’s dwell in blockheadedness. This is optional, but what I like to do first is hit a hard tabletop with my head once or twice, just to make the wall-hitting sessions a bit easier down the line.
Now, with a newly acquired (optional) slight head curve, let’s address the elephant of abstraction: what is Kubernetes, and how does it work?
Kubernetes is an API service that takes requests, stores them in a database, and directs the operating system to complete the requested operation.
For example, if we ask Kubernetes to run an application, it will take that request via the REST API, record it in etcd, and tell the OS to run a container.
How does Kubernetes do this?
By running a set of small applications, each handling its own specific role. Some of these applications are maintained by the Cloud Native Computing Foundation (the Kubernetes team), while others are third-party tools, either directly used by Kubernetes or later standardized by the community as part of the Kubernetes platform.
The list of applications is entirely dependent on us and what we want from it, but for our bare-bones cluster, it comes down to: etcd, kube-apiserver, kubelet, cri-o, scheduler, and controller(s).
All of this can be found in the official documentation.
Now, with this information in mind, let’s talk about provisioning the cluster.
Previously, we used the kubeadm tool. So, what role does kubeadm play in provisioning the cluster?
Well, I could explain it in a few words, but where’s the fun in that? Instead, listen to my unintelligent, borderline deranged rant.
kubeadm automates the provisioning of Kubernetes by running prechecks, configuring the OS, generating certificates, pulling binaries, configuring services, etc.
Why is it needed?
Tricky question here. kubeadm is not needed to create a cluster, but it is needed, and was created (in my opinion), to address the shitty community of engineers.
What am I on about?
Well, let me tell you a tale. A tale of a young technology, freshly reheated by Google and presented to the world as a new hot cake.
When this microwaved youngling was becoming popular, the term “native Kubernetes” also became popularized.
“Native Kubernetes” was used by wannabe engineers who didn’t understand what Kubernetes was but refused to shut up about it.
Think about it this way: I listed a bunch of applications and claimed that they were enough to build a cluster, which is true.
Based on that list, what makes a Kubernetes cluster “native Kubernetes”?
If I swapped etcd with another database, would it still be Kubernetes?
What about the scheduler?
Leaving swapping things aside, what about adding things?
The cluster is bare bones, right? We want networking. What options do we have?
How about Cilium, Calico, Flannel, OVN, Multus…
Which one of these is the so-called “native Kubernetes”?
What about the UI? We can install the Kubernetes Dashboard, Rancher, OpenShift Console, K9s… which one of these is “native”?
Stupid, right? The term “native Kubernetes” was never anything logical, it was just engineers with little to no understanding talking about it.
These days, “native Kubernetes” has slowly lost attention. There are a lot of reasons for this, mainly because the community got better, but the CNCF certainly hasn’t just been sitting around.
kubeadm is one of the pushes CNCF put out.
What am I talking about?
Well, all this “native Kubernetes” talk was hurting Kubernetes.
I had direct experience with this from the standpoint of a company partnered with Red Hat, trying to sell the OpenShift solution.
When we went to a client, they always claimed to prefer “native Kubernetes” instead of OpenShift and didn’t want to get locked in.
Weirdly, their definition of “native Kubernetes” was another distribution, mainly community ones.
This was hurting sales of OpenShift and other enterprise-grade Kubernetes solutions, which in turn hurt the Kubernetes project itself.
You may ask: why would the sales of OpenShift and others hurt Kubernetes?
Well, Kubernetes is a community project maintained by the Cloud Native Computing Foundation (CNCF).
Who are the contributors, and donors?
Companies like Google, Red Hat, Microsoft, IBM, Amazon, etc.
The list can be found here.
How did CNCF address this? Well, I’m sure they took several approaches, but here’s one of them:
They introduced the Certified Kubernetes.
The program basically allowed CNCF to define guidelines for what a distribution should support to qualify as “Certified Kubernetes.”
As a vendor, you have to follow these guidelines and present your distribution for certification.
How did this change the situation?
CNCF basically said that if you hold the “Certified Kubernetes” badge, you’re able to host or migrate any application that runs on the so-called “native Kubernetes.”
This allowed sales teams to reference the program whenever a client brought up “native Kubernetes.”
Another way “native Kubernetes” was hurting the project was through misattribution of effort.
Instead of Kubernetes being discussed for what it was, engineers were talking about community distributions.
This happened because setting up Kubernetes was complicated, and engineers who Googled “install Kubernetes” would find these distributions first instead of Kubernetes itself.
The companies behind these distributions used all that influence for their own benefit.
What do I mean?
Well, go back to the list of contributors.
Do tell, where are these popular community distributions on the list?
They’re not there. I guess they didn’t have the resources to contribute to the project itself.
No, no, they were too busy selling enterprise support or feature “enhancements” as a license or subscription.
Leaving contributions aside, the influence was downright atrocious.
For example: why would clients claim that this community distribution was “native Kubernetes” while its competitors were not?
Why is NGINX mostly referenced as a Kubernetes ingress when NGINX was, and is, a web server?
How about Docker being used as a runtime for a while, even when its maintainers refused to fix architectural issues and outright rejected commits from parties they saw as competition?
it was just a coincidence, right ?
kubeadm came into play to solve the issue of creating a Kubernetes cluster with no branding or distribution attached to it, and to bootstrap a cluster that followed the CNCF guidelines, kind of like having your own Certified Kubernetes cluster.
kubeadm is not meant to be used as a tool to set up a production-ready cluster.
Even the CNCF documentation states as much, quoting:
“We expect higher-level and more tailored tooling to be built on top of kubeadm, and ideally, using kubeadm as the basis of all deployments will make it easier to create conformant clusters.”
The certification program and kubeadm solved both issues: misinformation controlling the narrative of what Kubernetes was, and third-party distributors injecting their influence into the community.
Now, you can install Kubernetes directly from source, meaning the Kubernetes project gets more visibility, instead of some random startup that just packaged Kubernetes with its branding.
And if distributions want to be “Certified,” they either have to go through the certification program or use kubeadm as a base and build on top of it, allowing CNCF to influence a wider part of the community, either through the certification program or to kubeadm.
I want to rant more, but I’ve digressed enough as is. Back to provisioning the cluster.
kubeadm is not needed to create a cluster, it just makes creation easier for someone who doesn’t know or doesn’t want to know what Kubernetes is.
But it is definitely not needed to operate the cluster itself.
What we need are the applications I listed previously. Let me provide more details on them.
API Server, maintained by the CNCF.
This application provides the HTTP API for the Kubernetes cluster.
It is a single binary that connects to etcd and exposes an HTTPS interface.
This is the application we talk to on port 6443. It has one job: to handle API requests.
It doesn’t even handle the authentication of the call, it can check if a request is authorized, but it will not generate a token or anything like that.
etcd datastore, a key-value database developed by the CoreOS team and later transferred to the CNCF.
The application itself has nothing to do with Kubernetes; Kubernetes simply uses it as its database.
This is where the API server stores data such as ConfigMaps, Deployments, Secrets, etc.
Kube Controller, a set of “controller” applications.
Each controller tracks its respective Kubernetes objects, either taking action to meet the desired state or simply updating the object’s state.
Controllers take action either by communicating with the API server or by directly calling a third-party service.
For example, EndpointSlice is a controller responsible for updating Pod IPs for a Service. It just sits there, watching and updating the list as needed, all of this is done via the Kubernetes API.
Most controllers do the same.
An example of a controller that talks to a third-party service is the Cluster Autoscaler.
If we are scaling the cluster using a controller, that controller must talk to an external service that provisions compute resources outside Kubernetes.
The CNCF provides core controllers as a single binary connecting to the API server.
Kube Scheduler, another single binary.
It watches for Pods with no assigned nodes, then selects a node based on factors such as available compute and labels.
Kubelet, an agent-type binary running on each Kubernetes node.
Kubelet watches Kubernetes objects and checks if the scheduler has assigned a container to the node it represents. If so, it runs the container on that node.
It also exposes an API service that is used when we request logs from a running Pod or execute a command inside it.
Kubelet runs containers by communicating with the container runtime.
In previous versions, this was done by essentially running Docker commands on the host.
Later, as the solution matured, a standardized way of communication was created, this standardized communication is called the Container Runtime Interface (CRI).
It’s basically a common language for container runtimes to understand.
Because of this, which runtime we use doesn’t matter to Kubelet, they just need to understand CRI calls.
Container Runtime, the application that actually runs containers, such as Docker(Containerd).
This is enough to run a bare-bones cluster.
Yes, it really is. With this, we can have a Kube API that accepts kubectl calls and can even run containers when requested.
We won’t have any networking, interfaces, authorization, or other advanced features, but it will be a Kubernetes cluster.
Kube, BlockHead Edition
Kubernetes is a set of small applications, let’s run these applications as we would run any other application.
Starting with the database, etcd. We can set up etcd as we wish, the only requirement from Kubernetes standpoint is an up-and-running service for the API server.
We can run it as a systemd service or even outside of Kubernetes. It doesn’t really matter, well, it does, but that depends on what we want to create.
If we’re designing a high-performance etcd cluster, we might even run it on bare metal.
But since we’re doing a lab with an emphasis on cheapness, and we’re using CoreOS as the operating system, the only good option is to run etcd as containers.
But here’s the thing: we’re creating Kubernetes to host containers, do we have to run this outside of Kubernetes?
Well, we can, but we don’t want to. We already have Kubernetes as a single interface to manage our resources, why would we want to introduce an element that won’t be controlled by Kubernetes?
So yes, we do want to have this inside Kubernetes. And why can’t we do it ?
Kubernetes doesn’t actually run containers itself, the container runtime does. So, in theory, we should be able to run a container and somehow make Kubernetes see this container.
What I did here is a very good approach to learning a new system.
I analyzed the task, in this case, running etcd in Kubernetes, which, at the end of the day, is just a container running on Linux, like Docker.
Then I reasoned: if Kubernetes itself doesn’t run containers but the container runtime does, I should be able to run the container with the runtime and later make Kubernetes aware of it.
Because we broke Kubernetes down into components, we’re able to imagine how it could work even without specific experience.
Luckily for us, this is not complicated at all.
What we want to do is relatively common in Kubernetes. During a simple search or bit of research, we’re bound to learn how Kubernetes handles this.
Static Pods are a feature of Kubelet that allows it to run a Pod (container) regardless of the Kubernetes cluster’s status, for example, if Kubelet is unable to reach the API or, in our case, if Kubernetes doesn’t yet exist.
To run etcd as a static Pod, we simply have to create a manifest file directly on the node where we want the Pod to run.
This is also how kubeadm provisions etcd.
This is another good trick to keep in mind when learning new systems: we’re trying to create a cluster from scratch without using provisioning tools, we’re doing this to avoid the tool doing things for us, but we can and should analyze what the tool does.
It already accomplishes the task, we can just observe how it approaches it. This will give us more insight into the inner workings of the application, or we can simply copy its approach if it works for us.
Now that we know how and where to run etcd, let’s talk about the application itself.
We can research each application separately, in this case, etcd.
This might feel awful, but these applications are very common; they’ll keep showing up. Better to learn about them now.
etcd is a key-value datastore. The application forms a cluster with a quorum method, where the cluster is made up of an odd number of members, selecting a leader by majority. Quorum is part of a design to avoid a split-brain scenario, where connection errors between cluster members could cause multiple leaders to be active, thus corrupting data.
etcd exposes two interfaces:
- REST API, exposed on 2379; this is the interface where clients (e.g., the kube-apiserver) connect to read/write data.
- Cluster API, exposed on 2380; this is the interface used by cluster members to exchange data and form quorum.
Communication to etcd is encrypted using TLS/SSL certificates. Authentication of calls is also handled using client TLS certificates.
All of this can be found in etcd’s official documentation.
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://${ip}:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kubelius-etcd
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://${ip}:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://${ip}:2380
- --initial-cluster=control-01=https://10.10.16.11:2380,control-02=https://10.10.16.12:2380,control-03=https://10.10.16.13:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://${ip}:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://${ip}:2380
- --name=${name}
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: ${images.etcd-image}
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /livez
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: etcd
readinessProbe:
failureThreshold: 3
httpGet:
host: 127.0.0.1
path: /readyz
port: 2381
scheme: HTTP
periodSeconds: 1
timeoutSeconds: 15
resources:
requests:
cpu: 100m
memory: 100Mi
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /readyz
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-datathis is lot of parameteres, let me point out few important ones:
trusted-ca-file: points to the file containing the trust chain for a client certificate. used to verify client certificate sent to 2379 port.peer-trusted-ca-file: points to the file containing the trust chain for peer client/server certificates. used to verify peer’s clients/server certificate sent to 2380 port.peer-key-file,peer-cert-file: storing peers client certificate, used by server to authenticate itself when connecting to other peers.cert-file,key-file: points to file containing servers certificates presented on 2379 port.mountPath, folder where data will be stored on the node.
Now that we know what etcd requires, let’s talk about how to provide them, mainly certificates. kubeadm generates them for us, but we don’t have it here.
No worries, we can use Terraform to generate our certificates. I am not going to get into the details of certificates, but I’ll try to create a general idea. Also, I would like to redirect interested personnel to my other post that is dedicated to certificates in Kubernetes.
SSL certificates are a pair of private and public keys:
- The
public keyholds public information on the certificate, such as who the certificate belongs to and who issued it. The public key is also responsible for encrypting information using a one-way function, meaning data encrypted with the public key is not decryptable with the public key, even by the one who encrypted it. - The
private keyholds the key capable of decrypting the data encrypted with the public key.
A public/private key pair allows us to connect to a service, retrieve the public key, and encrypt data with it.
For example, we can create a secret, encrypt it with the public key, and send it to the server. Because encrypted data is only decryptable with the private key, we can safely send this over the internet. When the service receives the information, it will decrypt it, read the secret, and now that both parties have a shared secret, they can safely communicate using encryption.
A Certificate Authority (CA) comes into play to address the issue of verifying the receiver.
We want to make sure who we are talking to is the entity we think it is.
if we’re logging in to mobile banking, our credentials are encrypted and sent to the service. If the service is not actually our bank, then we’ve just handed our credentials to a third party.
The CA solves this by creating a process to verify the source and sign their public key.
Something along these lines of: entity, call it Bank Corp, generates a private key and, with that private key, generates a Certificate Signing Request (CSR) where it seeks to identify itself as bank.com.
This CSR is sent to a CA, which takes responsibility for verifying the requester and their claim to identity (sending a verification email to bank.com, calling them, etc.). After verification is done, the CA issues a certificate containing the public key that includes the CA’s signature. When we access bank.com, it presents that certificate; we can check the issuer (signer) and verify whether that issuer is a trusted authority for us.
None of this is complicated, we can simply go with a demonstration; it should click right away.
First, we have to generate a CA. We can have a dedicated CA for etcd or share one with Kubernetes (which also requires a CA). Each choice has its own pros/cons. There is also a middle ground where we create a sub-CA called an intermediate.
From now on, I will try to minimize certificate talk. If you require more info on certificates, feel free to check my other post or research them directly, they are very important and show up everywhere.
resource "tls_private_key" "root-ca" {
algorithm = "RSA"
rsa_bits = 4096
}
resource "tls_self_signed_cert" "root-ca" {
private_key_pem = tls_private_key.root-ca.private_key_pem
is_ca_certificate = true
subject {
common_name = "root-ca"
}
validity_period_hours = 43830 #5 Years.
allowed_uses = [
"cert_signing",
"crl_signing",
"digital_signature",
]
}
now we can generate intermediate for etcd, and sign it with root.
resource "tls_private_key" "etcd-ca" {
algorithm = "RSA"
rsa_bits = 4096
}
resource "tls_cert_request" "etcd-ca" {
private_key_pem = tls_private_key.etcd-ca.private_key_pem
subject {
common_name = "etcd-ca"
}
}
resource "tls_locally_signed_cert" "etcd-ca" {
cert_request_pem = tls_cert_request.etcd-ca.cert_request_pem
ca_private_key_pem = tls_private_key.root-ca.private_key_pem
ca_cert_pem = tls_self_signed_cert.root-ca.cert_pem
is_ca_certificate = true
validity_period_hours = 26298
allowed_uses = [
"cert_signing",
"crl_signing",
"digital_signature",
]
}
and with intermediate CA avaliable to us, generating etcd’s certificates. lets split them in two parts, peers and server
resource "tls_private_key" "etcd-server" {
for_each = var.control-instances
algorithm = "RSA"
rsa_bits = 2048
}
resource "tls_cert_request" "etcd-server" {
for_each = var.control-instances
private_key_pem = tls_private_key.etcd-server[each.key].private_key_pem
subject {
common_name = each.key
organization = "Kubelius totallyacorp"
}
ip_addresses = [each.value["ip"], "127.0.0.1"]
dns_names = [each.key]
}
resource "tls_locally_signed_cert" "etcd-server" {
for_each = var.control-instances
cert_request_pem = tls_cert_request.etcd-server[each.key].cert_request_pem
ca_private_key_pem = tls_private_key.etcd-ca.private_key_pem
ca_cert_pem = tls_locally_signed_cert.etcd-ca.cert_pem
validity_period_hours = 8766
allowed_uses = [
"digital_signature",
"server_auth",
"client_auth",
"key_encipherment",
]
}
resource "tls_cert_request" "etcd-peer" {
for_each = var.control-instances
private_key_pem = tls_private_key.etcd-peer[each.key].private_key_pem
subject {
common_name = each.key
}
ip_addresses = [each.value["ip"]]
dns_names = [each.key]
}
resource "tls_locally_signed_cert" "etcd-peer" {
for_each = var.control-instances
cert_request_pem = tls_cert_request.etcd-peer[each.key].cert_request_pem
ca_private_key_pem = tls_private_key.etcd-ca.private_key_pem
ca_cert_pem = tls_locally_signed_cert.etcd-ca.cert_pem
validity_period_hours = 8766
allowed_uses = [
"digital_signature",
"server_auth",
"client_auth",
"key_encipherment",
]
}
now, we can inject them with ignition.
%{ for path, value in certs ~}
- path: ${path}
mode: 0644
contents:
inline: |
${indent(10, base64decode(value))}
%{ endfor ~}
data "external" "ignition_control" {
for_each = var.control-instances
program = ["sh", "scripts/butane.sh"]
query = {
config64 = base64encode(templatefile("templates/control.tftpl", {
ip = each.value["ip"], name = each.key
images = { etcd-image = var.static-pods.etcd-image }
certs = {
"/etc/kubernetes/pki/etcd/server.crt" = base64encode(tls_locally_signed_cert.etcd-server[each.key].cert_pem)
"/etc/kubernetes/pki/etcd/server.key" = base64encode(tls_private_key.etcd-server[each.key].private_key_pem)
"/etc/kubernetes/pki/etcd/peer.crt" = base64encode(tls_locally_signed_cert.etcd-peer[each.key].cert_pem)
"/etc/kubernetes/pki/etcd/peer.key" = base64encode(tls_private_key.etcd-peer[each.key].private_key_pem)
"/etc/kubernetes/pki/etcd/ca.crt" = base64encode(tls_locally_signed_cert.etcd-ca.cert_pem)
}
}))
}
}
variable "static-pods" {
type = map
default = {
etcd-image = "registry.k8s.io/etcd:3.5.21-0",
}
}
That’s it, we’ve got etcd. It will run on every controller node as a static Pod.
Now let’s jump to the other applications, the API server?
The API server is a single binary that listens on port 6443 and connects to etcd for data storage.
There are barely any rules we have to follow with the API server, it’s stateless, there’s no need for clustering. It doesn’t write any data to disk and doesn’t care how or where it’s running.
So, let’s run it similar to etcd: a single replica per controller, running as a static Pod.
It’s basically the same formula as etcd, let me just go through the motions.
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: ${ip}:6443
creationTimestamp: null
labels:
component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
- --advertise-address=${ip}
- --allow-privileged=true
- --authorization-mode=Node,RBAC
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --enable-admission-plugins=NodeRestriction
- --enable-bootstrap-token-auth=true
- --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
- --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
- --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
- --etcd-servers=https://127.0.0.1:2379
- --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
- --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
- --kubelet-preferred-address-types=Hostname,InternalDNS,InternalIP
- --secure-port=6443
- --service-account-issuer=https://kubernetes.default.svc.cluster.local
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-cluster-ip-range=10.96.0.0/16
- --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
- --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
image: ${images.api-image}
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: ${ip}
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-apiserver
readinessProbe:
failureThreshold: 3
httpGet:
host: ${ip}
path: /readyz
port: 6443
scheme: HTTPS
periodSeconds: 1
timeoutSeconds: 15
resources:
requests:
cpu: 250m
startupProbe:
failureThreshold: 24
httpGet:
host: ${ip}
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certs
readOnly: true
- mountPath: /etc/pki/ca-trust
name: etc-pki-ca-trust
readOnly: true
- mountPath: /etc/pki/tls/certs
name: etc-pki-tls-certs
readOnly: true
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
name: ca-certs
- hostPath:
path: /etc/pki/ca-trust
type: DirectoryOrCreate
name: etc-pki-ca-trust
- hostPath:
path: /etc/pki/tls/certs
type: DirectoryOrCreate
name: etc-pki-tls-certs
- hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
name: k8s-certsetcd-cafile: verifies etcd as a trusted endpoint by comparing issuer’s certificate aginst certificate stored in pointed file.etcd-certfile,--etcd-keyfile: files containing client certificate to authenticate with etcd.kubelet-client-certificate,--kubelet-client-key: authenticate the API server’s requests sent to the kubelet endpoint.service-account-key-file,--service-account-issuer,--service-account-signing-key-file: service account token signing and verification. used when kube generate service account tokens.tls-private-key-file,--tls-cert-file: certificate used for the API server’s HTTPS(6443) endpoint.
let me get intermediate out of the way
resource "tls_cert_request" "kubernetes-ca" {
private_key_pem = tls_private_key.kubernetes-ca.private_key_pem
subject {
common_name = "kubernetes-ca"
}
}
resource "tls_locally_signed_cert" "kubernetes-ca" {
cert_request_pem = tls_cert_request.kubernetes-ca.cert_request_pem
ca_private_key_pem = tls_private_key.root-ca.private_key_pem
ca_cert_pem = tls_self_signed_cert.root-ca.cert_pem
is_ca_certificate = true
validity_period_hours = 26298
allowed_uses = [
"cert_signing",
"crl_signing",
"digital_signature",
]
}
well, pretty simillar.
resource "tls_private_key" "service-account" {
algorithm = "RSA"
rsa_bits = 2048
}
The etcd client certificate is similar to the one we created for etcd, but the only requirement is the client extension, whereas etcd itself required both client and server extensions.
resource "tls_cert_request" "kube-apiserver-etcd-client" {
private_key_pem = tls_private_key.kube-apiserver-etcd-client.private_key_pem
subject {
common_name = "kube-apiserver-etcd-client"
}
}
resource "tls_locally_signed_cert" "kube-apiserver-etcd-client" {
cert_request_pem = tls_cert_request.kube-apiserver-etcd-client.cert_request_pem
ca_private_key_pem = tls_private_key.etcd-ca.private_key_pem
ca_cert_pem = tls_locally_signed_cert.etcd-ca.cert_pem
validity_period_hours = 8766
allowed_uses = [
"digital_signature",
"client_auth",
"key_encipherment",
]
}
client for kublet.
resource "tls_cert_request" "kube-apiserver-kubelet" {
private_key_pem = tls_private_key.kube-apiserver-kubelet.private_key_pem
subject {
common_name = "kube-apiserver-kubelet"
organization = "system:masters"
}
}
resource "tls_locally_signed_cert" "kube-apiserver-kubelet" {
cert_request_pem = tls_cert_request.kube-apiserver-kubelet.cert_request_pem
ca_private_key_pem = tls_private_key.kubernetes-ca.private_key_pem
ca_cert_pem = tls_locally_signed_cert.kubernetes-ca.cert_pem
validity_period_hours = 8766
allowed_uses = [
"digital_signature",
"client_auth",
"key_encipherment",
]
}
ignition.
data "external" "ignition_control" {
for_each = var.control-instances
program = ["sh", "scripts/butane.sh"]
query = {
config64 = base64encode(templatefile("templates/control.tftpl", {
ip = each.value["ip"], name = each.key
images = { etcd-image = var.static-pods.etcd-image, var.static-pods.api-image}
certs = {
"/etc/kubernetes/pki/etcd/ca.crt" = base64encode(tls_locally_signed_cert.etcd-ca.cert_pem)
"/etc/kubernetes/pki/ca.crt" = base64encode(tls_locally_signed_cert.kubernetes-ca.cert_pem)
"/etc/kubernetes/pki/ca.key" = base64encode(tls_private_key.kubernetes-ca.private_key_pem)
"/etc/kubernetes/pki/apiserver-etcd-client.crt" = base64encode(tls_locally_signed_cert.kube-apiserver-etcd-client.cert_pem)
"/etc/kubernetes/pki/apiserver-etcd-client.key" = base64encode(tls_private_key.kube-apiserver-etcd-client.private_key_pem)
"/etc/kubernetes/pki/apiserver-kubelet-client.crt" = base64encode(tls_locally_signed_cert.kube-apiserver-kubelet.cert_pem)
"/etc/kubernetes/pki/apiserver-kubelet-client.key" = base64encode(tls_private_key.kube-apiserver-kubelet.private_key_pem)
"/etc/kubernetes/pki/sa.key" = base64encode(tls_private_key.service-account.private_key_pem)
"/etc/kubernetes/pki/sa.pub" = base64encode(tls_private_key.service-account.public_key_pem)
"/etc/kubernetes/pki/apiserver.key" = base64encode(tls_private_key.kube-apiserver[each.key].private_key_pem)
"/etc/kubernetes/pki/apiserver.crt" = base64encode(tls_locally_signed_cert.kube-apiserver[each.key].cert_pem)
}
}))
}
}
That’s it, we now have a Kube API and etcd.
Let’s keep going: the Scheduler.
The Scheduler is the simplest component so far. The previous entities had the responsibilities of a server. The Scheduler has no such responsibility, it’s a simple application that watches Kubernetes resources via the Kube API and takes action using the same API.
It’s similar to us running kubectl get with the --watch (-w) option and calling kubectl patch for update actions.
It even uses the same configuration as kubectl, I’m talking about the kubeconfig.
running the Scheduler amounts to running the binary and passing the kubeconfig to it.
Let me demonstrate, following pattern of static Pods.
kubeconfig template
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ${ca64}
server: https://${ip}:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: ${value.user}
name: ${value.user}@kubernetes
current-context: ${value.user}@kubernetes
kind: Config
preferences: {}
users:
- name: ${value.user}
user:
client-certificate-data: ${value.cert}
client-key-data: ${value.key}
Simple, right? It only requires a client certificate for authentication, well, this is actually the interesting part.
The interesting part lies in the question: “What permissions will this scheduler have inside Kubernetes?”
We’re just generating a client certificate, which we’ve already done a few times.
If certificates are generated in the same way, how does Kubernetes distinguish between certificate roles?
Well, it’s actually very simple, let’s generate them first.
resource "tls_cert_request" "scheduler" {
private_key_pem = tls_private_key.scheduler.private_key_pem
subject {
common_name = "system:kube-scheduler"
}
}
See the subject part? It contains the common_name field, this is how Kubernetes identifies a certificate.
This, along with the second optional field organization, defines identity and group membership.
When Kubernetes receives a client certificate, it looks for these two fields and maps them to common_name as the user and organization as the group.
From there, they fall under the RBAC rules.
if we want the Scheduler to have specific permissions, we can simply create a RoleBinding and target the client certificate as a user or group.
This should also raise another question: we haven’t created any permissions so far, so how would the certificates we’ve defined actually work?
Well, lucky for us, Kubernetes comes with default permissions. They’re just a bunch of Roles and RoleBindings created alongside the cluster.
With the interesting part out of the way, let’s wrap up the Scheduler, the rest of it is the same as the other two.
The Controller is the last one. It’s basically the same as the Scheduler, with the additional responsibility of managing the Certificate Authority.
Nothing special, really, we just hand over the intermediate’s private key.
This allows the Controller to sign additional certificates and rotate existing ones.
I’ll skip the demonstration for the Controller.
| Cost | 0.045$(Hour) |
| Cost under free tier | free (180 Hours, per month) |
| Time to provision lab | minute or two |
| Time to Destruct lab | five minutes |
| Coammnds to init Lab | fiveish |
| Commands to Create/Destruct Lab | Single |
cost calc: (0.0047(t3a.nano) * 4) + 0.0007(Public IP) + 0.025(ELB) + 0.0?(S3)
This is pretty much it, the code above will provision a three-node cluster.
etcd will be there as well, kind of. The namespace is set to kubelius-etcd, which doesn’t exist yet. Just create it with:
kubectl create namespace kubelius-etcd
and etcd will show up under that namespace:
kubectl get pods -n kubelius-etcd
cool,
Time to talk benefits.
Creation time, the lab creation time went down, nice, but nothing significant. Granted, this depends on what we’re comparing it to.
compared to the previous lab, then no, I wouldn’t claim that creation time went down; I could make the other lab provision in the same number of commands and in the same timeframe.
Now, if we compare this lab to other means of creation, then of course it’s a huge boon, especially considering how adding things to a lab usually increases both the effort and the time required for provisioning.
With this approach, having total control over changes keeps the manual creation effort and provisioning time to a bare minimum.
Cost, very low. We can make it even lower, but it really doesn’t matter at this point.
Persistency, with the IaC approach, we can basically recreate the same lab years later, with no cost for storing it.
Knowledge, the main advantage, really. Just think about how much understanding you gain when you approach labs this way.
In a kubeadm setup, can we troubleshoot:
- why a node isn’t able to establish a connection to other nodes?
etcdnot starting up, either due to connection or authorization issues?- Pods being unable to be scheduled because the Scheduler has insufficient permissions?
- being unable to get logs from a Pod?
The list goes on.
Having no underlying understanding of the platform results in never-ending issues.
It’s funny how all the weird issues go out the window once you gain this kind of understanding, how the most devastating problems can be traced back to the silliest things, how you might end up on a late-night call with engineers digging in entirely the wrong place.
Creating this lab took time. I can’t really say how much time it actually took.
Doing labs this way is convoluted by nature, so many times I’ve started working toward something and ended up in a totally different subject.
This is normal and shouldn’t be avoided; this is how real knowledge is accumulated.
We have to understand the subject and map it to something we can specifically comprehend.
We have to answer our own questions over and over until all the questions are gone.
My presentation of this lab mirrors that approach, just asking questions repeatedly.
Eventually, you start breaking down abstractions, and near the end, you realize that no system is truly complex.
Everything is simple, as long as you understand it.
With everything out of the way, I’m wrapping up my writing, rereading and fixing little things while wondering why I put so much effort into ranting on labs.
Not sure. Not at all.
I do wish to help people out, but I neither have the reach nor have I covered anything important that would truly help someone, thats not it.
I did enjoy the rant, though, but I enjoy them all.
It wasn’t easy, nor was it quick. I haven’t made any major discoveries during research.
Hmm. No clear reason, huh?
Guess that’s it.
I really am broke.