Enough theory, let's actually set up CNI! For the rest of this post, we'll be
working with our original node (mink8s in my case).
First we need to pick an IP range for our pods. I'm going to arbitrarily decide
that my mink8s node is going to use the 10.12.1.0/24 range
(i.e. 10.12.1.0 - 10.12.1.255). That gives us more than enough IPs to work
with for our purposes. (When we go multi-node we can give the other nodes in our
cluster similar ranges.)
The first thing we'll have to do (to save many hours of debugging woes) is to
disable Docker's built-in networking entirely. For boring historical reasons,
Docker does not use CNI, and its built-in solution interferes with the setup
we're going for. Edit /etc/docker/daemon.json to look like this:
{
"exec-opts": ["native.cgroupdriver=cgroupfs"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"bridge": "none",
"iptables": false,
"ip-masq": false,
"storage-driver": "overlay2"
}Most of these settings aren't important for our purposes, but the bridge,
iptables, and ip-masq options are critical. Once you've edited that file,
reboot the machine to clear out old network settings and iptables
rules. (Trust me, this will make your life much easier! It's also probably a
good idea to delete any existing pods you have running to avoid confusion.)
Now we'll have to get CNI up and running. We're going to use the example plugins
provided by the CNI project; by convention, the binaries live in /opt/cni/bin:
$ curl -L https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz > cni.tgz
$ sudo mkdir -p /opt/cni/bin
$ sudo tar xzvf cni.tgz -C /opt/cni/binNow we'll make a CNI network configuration file that will use the bridge CNI
plugin, which sets up networking according to the basic scheme outlined
earlier. Confusingly, to use CNI we actually need to configure two
plugins: a "main" plugin and an "IPAM" plugin (IPAM stands for IP Address
Management). The IPAM plugin is responsible for allocating IPs for pods while
the main plugin does most of the rest of the configuration. We'll be using the
host-local IPAM plugin, which just allocates IPs from a range and makes sure
there are no overlaps on the host.
OK enough theory—let's take a first crack at a minimal CNI
configuration. Kubelet will look for CNI configuration files in the
/etc/cni/net.d directory by default. Put the following in
/etc/cni/net.d/mink8s.conf:
{
"cniVersion": "0.3.1",
"name": "mink8s",
"type": "bridge",
"bridge": "mink8s0",
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "10.12.1.0/24"}]
]
}
}To dissect that configuration a bit:
typeandipam.typespecify the actual plugin binary names (so it will look for/opt/cni/bin/bridgeand/opt/cni/bin/host-localfor the plugins we're using).bridgespecifies the name of the network bridge that thebridgeplugin will create.ipam.rangesspecifies the IP ranges to allocate to pods. In our case, we're going to allocate IPs in the10.12.1.0/24range.
Now we'll restart kubelet and pass the network-plugin=cni option:
$ sudo ./kubelet --network-plugin=cni --pod-manifest-path=pods --kubeconfig=kubeconfig.yamlAnd then we'll create two "sleeping" pods to see if networking actually works:
$ for i in 1 2; do cat <<EOS | ./kubectl apply -f - ; done
---
apiVersion: v1
kind: Pod
metadata:
name: sleep${i}
spec:
containers:
- image: alpine
name: alpine
command: ["sleep", "5000000"]
nodeName: mink8s
EOSSome poking around shows that both pods get IP addresses and can ping each other, which is a great first step!
$ ./kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
sleep1 1/1 Running 0 7s 10.12.1.4 mink8s <none> <none>
sleep2 1/1 Running 0 6s 10.12.1.5 mink8s <none> <none>
$ ./kubectl exec sleep1 -- ping 10.12.1.5
PING 10.12.1.5 (10.12.1.5): 56 data bytes
64 bytes from 10.12.1.5: seq=0 ttl=64 time=0.627 ms
64 bytes from 10.12.1.5: seq=1 ttl=64 time=0.075 ms
64 bytes from 10.12.1.5: seq=2 ttl=64 time=0.116 msSome more poking around shows that the bridge plugin has indeed created a
bridge named mink8s0 as well as a veth pair for each pod:
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:cf:81:3d brd ff:ff:ff:ff:ff:ff
3: mink8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff
6: veth19e99be3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: veth5947e6fb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
link/ether b2:b6:d4:49:fb:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 1(Annoyingly, kubelet creates the network namespaces in such a way that they
don't show up in ip netns. But the link-netnsid attribute gives a hint that
the veths are indeed connected to veths in other namespaces.)
We're still a ways off from implementing our full Kubernetes network model, however. Pinging the pods from the host doesn't work (which you may remember is a requirement of the model), and neither does pinging the host from the pods (which I don't think is a strict requirement in theory but is going to be essential in practice):
$ HOST_IP=10.70.10.228 # set to whatever your host's internal IP address is
$ ping 10.12.1.4
PING 10.12.1.4 (10.12.1.4) 56(84) bytes of data.
^C
--- 10.12.1.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2047ms
$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1The reason the host and pods can't communicate with each other is that they're
on different network subnets (in my case, 10.12.1.0/24 for the pods and
10.70.0.0/16 for the VM), which means they can't communicate directly over
Ethernet and will need to use IP routing to find each other (for the
networking-jargon-inclined: we need to go from layer 2 to layer 3). Linux
bridges work on layer 2 by default, but can actually handle layer 3 routing just
fine if you assign IP addresses to them. (You can confirm that the bridge
doesn't currently have an IP address with ip addr show dev mink8s0.)
To configure the bridge to use layer 3 routing, we'll set the isGateway option
in our CNI config file. Here's our next attempt at the configuration:
{
"cniVersion": "0.3.1",
"name": "mink8s",
"type": "bridge",
"bridge": "mink8s0",
"isGateway": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "10.12.1.0/24"}]
]
}
}Whenever we change the CNI configuration, we'll want to delete and recreate all our pods, since the networking configuration is only used on pod creation/deletion. Once we do that, we find that the bridge has been given an IP address and we can ping the pods from the host, but pinging the host from the pods still doesn't work:
$ ip addr show dev mink8s0 | grep 10.12
inet 10.12.1.1/24 brd 10.12.1.255 scope global mink8s0
$ ./kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
sleep1 1/1 Running 0 5m56s 10.12.1.8 mink8s <none> <none>
sleep2 1/1 Running 0 5m55s 10.12.1.7 mink8s <none> <none>
$ ping -c 3 10.12.1.8
PING 10.12.1.8 (10.12.1.8) 56(84) bytes of data.
64 bytes from 10.12.1.8: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from 10.12.1.8: icmp_seq=2 ttl=64 time=0.087 ms
64 bytes from 10.12.1.8: icmp_seq=3 ttl=64 time=0.099 ms
$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1The reason it's still not working is that the pod doesn't have a default route
set up (you can confirm this with ./kubectl exec sleep1 -- ip route). We can
solve this problem by adding a default route in our CNI config. Let's add a
route to our configuration to 0.0.0.0/0 (i.e. everywhere):
{
"cniVersion": "0.3.1",
"name": "mink8s",
"type": "bridge",
"bridge": "mink8s0",
"isGateway": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "10.12.1.0/24"}]
],
"routes": [{"dst": "0.0.0.0/0"}]
}
}(For reasons I don't entirely understand, setting up routes is the
responsibility of the IPAM plugin instead of the bridge plugin.) Once that's
saved and our pods have been killed and recreated, we see the default route is
set up and pinging the host works fine:
$ ./kubectl exec sleep1 -- ip route
default via 10.12.1.1 dev eth0
10.12.1.0/24 dev eth0 scope link src 10.12.1.9
$ ./kubectl exec sleep1 -- ping -c3 $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
64 bytes from 10.70.10.228: seq=0 ttl=64 time=0.110 ms
64 bytes from 10.70.10.228: seq=1 ttl=64 time=0.269 ms
64 bytes from 10.70.10.228: seq=2 ttl=64 time=0.233 msOur pods can now talk to each other (on the same node) and the host and pods can also talk to each other. So technically you could say we've implemented the Kubernetes networking model for one node. But there's still a glaring omission, which we'll see if we try to ping an address outside of our network:
$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1Our pods can't reach the Internet! This isn't particularly surprising, since our pods are connected to the bridge network, not the actual Ethernet adapter of the host.
To get outgoing Internet connectivity working, we'll need to set up NAT using
the IP masquerade feature of iptables. (NAT is necessary in this case because
all of our pods are going to share the external IP address of our host.) The
bridge plugin has us covered with the ipMasq option. Let's save our final
(for this blog) CNI configuration:
{
"cniVersion": "0.3.1",
"name": "mink8s",
"type": "bridge",
"bridge": "mink8s0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "10.12.1.0/24"}]
],
"routes": [{"dst": "0.0.0.0/0"}]
}
}Once that's applied, our pods can reach the Internet:
$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=51 time=4.343 ms
64 bytes from 1.1.1.1: seq=1 ttl=51 time=4.189 ms
64 bytes from 1.1.1.1: seq=2 ttl=51 time=4.285 msWe can see the IP masquerade rules created by the plugin by poking around with
iptables:
$ sudo iptables --list POSTROUTING --numeric --table nat
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
CNI-c07db3c8c34133af9e525bf4 all -- 10.12.1.11 0.0.0.0/0 /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
CNI-a874815f36fc490c823cf894 all -- 10.12.1.12 0.0.0.0/0 /* name: "mink8s" id: "f98855905b1b070f7aa7387c844308d53fbeeeba65a23a075cfe6f12ea516005" */
$ sudo iptables -L CNI-c07db3c8c34133af9e525bf4 -n -t nat
Chain CNI-c07db3c8c34133af9e525bf4 (1 references)
target prot opt source destination
ACCEPT all -- 0.0.0.0/0 10.12.1.0/24 /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
MASQUERADE all -- 0.0.0.0/0 !224.0.0.0/4 /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */For those not fluent in iptablesese, here's a rough translation of these rules:
- If a packet comes from a pod IP address, use a special iptables chain for that
pod (e.g. in this example,
10.12.1.11uses theCNI-c07db3c8c34133af9e525bf4chain). - In that chain, if the packet isn't going to the pod's local network or a
special multicast address (the
224.0.0.0/4business), masquerade it.