Deconstructing Kubernetes Networking

Enough theory, let's actually set up CNI! For the rest of this post, we'll be working with our original node (mink8s in my case).

First we need to pick an IP range for our pods. I'm going to arbitrarily decide that my mink8s node is going to use the 10.12.1.0/24 range (i.e. 10.12.1.0 - 10.12.1.255). That gives us more than enough IPs to work with for our purposes. (When we go multi-node we can give the other nodes in our cluster similar ranges.)

The first thing we'll have to do (to save many hours of debugging woes) is to disable Docker's built-in networking entirely. For boring historical reasons, Docker does not use CNI, and its built-in solution interferes with the setup we're going for. Edit /etc/docker/daemon.json to look like this:

{
  "exec-opts": ["native.cgroupdriver=cgroupfs"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "bridge": "none",
  "iptables": false,
  "ip-masq": false,
  "storage-driver": "overlay2"
}

Most of these settings aren't important for our purposes, but the bridge, iptables, and ip-masq options are critical. Once you've edited that file, reboot the machine to clear out old network settings and iptables rules. (Trust me, this will make your life much easier! It's also probably a good idea to delete any existing pods you have running to avoid confusion.)

Now we'll have to get CNI up and running. We're going to use the example plugins provided by the CNI project; by convention, the binaries live in /opt/cni/bin:

$ curl -L https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz > cni.tgz
$ sudo mkdir -p /opt/cni/bin
$ sudo tar xzvf cni.tgz -C /opt/cni/bin

Now we'll make a CNI network configuration file that will use the bridge CNI plugin, which sets up networking according to the basic scheme outlined earlier. Confusingly, to use CNI we actually need to configure two plugins: a "main" plugin and an "IPAM" plugin (IPAM stands for IP Address Management). The IPAM plugin is responsible for allocating IPs for pods while the main plugin does most of the rest of the configuration. We'll be using the host-local IPAM plugin, which just allocates IPs from a range and makes sure there are no overlaps on the host.

OK enough theory—let's take a first crack at a minimal CNI configuration. Kubelet will look for CNI configuration files in the /etc/cni/net.d directory by default. Put the following in /etc/cni/net.d/mink8s.conf:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ]
    }
}

To dissect that configuration a bit:

type and ipam.type specify the actual plugin binary names (so it will look for /opt/cni/bin/bridge and /opt/cni/bin/host-local for the plugins we're using).
bridge specifies the name of the network bridge that the bridge plugin will create.
ipam.ranges specifies the IP ranges to allocate to pods. In our case, we're going to allocate IPs in the 10.12.1.0/24 range.

Now we'll restart kubelet and pass the network-plugin=cni option:

$ sudo ./kubelet --network-plugin=cni --pod-manifest-path=pods --kubeconfig=kubeconfig.yaml

And then we'll create two "sleeping" pods to see if networking actually works:

$ for i in 1 2; do cat <<EOS | ./kubectl apply -f - ; done
---
apiVersion: v1
kind: Pod
metadata:
  name: sleep${i}
spec:
  containers:
  - image: alpine
    name: alpine
    command: ["sleep", "5000000"]
  nodeName: mink8s
EOS

Some poking around shows that both pods get IP addresses and can ping each other, which is a great first step!

$ ./kubectl get po -owide
NAME     READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
sleep1   1/1     Running   0          7s    10.12.1.4   mink8s   <none>           <none>
sleep2   1/1     Running   0          6s    10.12.1.5   mink8s   <none>           <none>
$ ./kubectl exec sleep1 -- ping 10.12.1.5
PING 10.12.1.5 (10.12.1.5): 56 data bytes
64 bytes from 10.12.1.5: seq=0 ttl=64 time=0.627 ms
64 bytes from 10.12.1.5: seq=1 ttl=64 time=0.075 ms
64 bytes from 10.12.1.5: seq=2 ttl=64 time=0.116 ms

Some more poking around shows that the bridge plugin has indeed created a bridge named mink8s0 as well as a veth pair for each pod:

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:cf:81:3d brd ff:ff:ff:ff:ff:ff
3: mink8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff
6: veth19e99be3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
    link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: veth5947e6fb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
    link/ether b2:b6:d4:49:fb:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 1

(Annoyingly, kubelet creates the network namespaces in such a way that they don't show up in ip netns. But the link-netnsid attribute gives a hint that the veths are indeed connected to veths in other namespaces.)

We're still a ways off from implementing our full Kubernetes network model, however. Pinging the pods from the host doesn't work (which you may remember is a requirement of the model), and neither does pinging the host from the pods (which I don't think is a strict requirement in theory but is going to be essential in practice):

$ HOST_IP=10.70.10.228 # set to whatever your host's internal IP address is
$ ping 10.12.1.4
PING 10.12.1.4 (10.12.1.4) 56(84) bytes of data.
^C
--- 10.12.1.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2047ms

$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1

The reason the host and pods can't communicate with each other is that they're on different network subnets (in my case, 10.12.1.0/24 for the pods and 10.70.0.0/16 for the VM), which means they can't communicate directly over Ethernet and will need to use IP routing to find each other (for the networking-jargon-inclined: we need to go from layer 2 to layer 3). Linux bridges work on layer 2 by default, but can actually handle layer 3 routing just fine if you assign IP addresses to them. (You can confirm that the bridge doesn't currently have an IP address with ip addr show dev mink8s0.)

To configure the bridge to use layer 3 routing, we'll set the isGateway option in our CNI config file. Here's our next attempt at the configuration:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ]
    }
}

Whenever we change the CNI configuration, we'll want to delete and recreate all our pods, since the networking configuration is only used on pod creation/deletion. Once we do that, we find that the bridge has been given an IP address and we can ping the pods from the host, but pinging the host from the pods still doesn't work:

$ ip addr show dev mink8s0 | grep 10.12
    inet 10.12.1.1/24 brd 10.12.1.255 scope global mink8s0
$ ./kubectl get po -owide
NAME     READY   STATUS    RESTARTS   AGE     IP          NODE     NOMINATED NODE   READINESS GATES
sleep1   1/1     Running   0          5m56s   10.12.1.8   mink8s   <none>           <none>
sleep2   1/1     Running   0          5m55s   10.12.1.7   mink8s   <none>           <none>
$ ping -c 3 10.12.1.8
PING 10.12.1.8 (10.12.1.8) 56(84) bytes of data.
64 bytes from 10.12.1.8: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from 10.12.1.8: icmp_seq=2 ttl=64 time=0.087 ms
64 bytes from 10.12.1.8: icmp_seq=3 ttl=64 time=0.099 ms
$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1

The reason it's still not working is that the pod doesn't have a default route set up (you can confirm this with ./kubectl exec sleep1 -- ip route). We can solve this problem by adding a default route in our CNI config. Let's add a route to our configuration to 0.0.0.0/0 (i.e. everywhere):

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

(For reasons I don't entirely understand, setting up routes is the responsibility of the IPAM plugin instead of the bridge plugin.) Once that's saved and our pods have been killed and recreated, we see the default route is set up and pinging the host works fine:

$ ./kubectl exec sleep1 -- ip route
default via 10.12.1.1 dev eth0
10.12.1.0/24 dev eth0 scope link  src 10.12.1.9
$ ./kubectl exec sleep1 -- ping -c3 $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
64 bytes from 10.70.10.228: seq=0 ttl=64 time=0.110 ms
64 bytes from 10.70.10.228: seq=1 ttl=64 time=0.269 ms
64 bytes from 10.70.10.228: seq=2 ttl=64 time=0.233 ms

Our pods can now talk to each other (on the same node) and the host and pods can also talk to each other. So technically you could say we've implemented the Kubernetes networking model for one node. But there's still a glaring omission, which we'll see if we try to ping an address outside of our network:

$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

Our pods can't reach the Internet! This isn't particularly surprising, since our pods are connected to the bridge network, not the actual Ethernet adapter of the host.

To get outgoing Internet connectivity working, we'll need to set up NAT using the IP masquerade feature of iptables. (NAT is necessary in this case because all of our pods are going to share the external IP address of our host.) The bridge plugin has us covered with the ipMasq option. Let's save our final (for this blog) CNI configuration:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

Once that's applied, our pods can reach the Internet:

$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=51 time=4.343 ms
64 bytes from 1.1.1.1: seq=1 ttl=51 time=4.189 ms
64 bytes from 1.1.1.1: seq=2 ttl=51 time=4.285 ms

We can see the IP masquerade rules created by the plugin by poking around with iptables:

$ sudo iptables --list POSTROUTING --numeric --table nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
CNI-c07db3c8c34133af9e525bf4  all  --  10.12.1.11           0.0.0.0/0            /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
CNI-a874815f36fc490c823cf894  all  --  10.12.1.12           0.0.0.0/0            /* name: "mink8s" id: "f98855905b1b070f7aa7387c844308d53fbeeeba65a23a075cfe6f12ea516005" */
$ sudo iptables -L CNI-c07db3c8c34133af9e525bf4 -n -t nat
Chain CNI-c07db3c8c34133af9e525bf4 (1 references)
target     prot opt source               destination
ACCEPT     all  --  0.0.0.0/0            10.12.1.0/24         /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
MASQUERADE  all  --  0.0.0.0/0           !224.0.0.0/4          /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */

For those not fluent in iptablesese, here's a rough translation of these rules:

If a packet comes from a pod IP address, use a special iptables chain for that pod (e.g. in this example, 10.12.1.11 uses the CNI-c07db3c8c34133af9e525bf4 chain).
In that chain, if the packet isn't going to the pod's local network or a special multicast address (the 224.0.0.0/4 business), masquerade it.