Happy Birthday XDP! - NFHN Reader

This post does not reflect the views of current, past, or future employers. The opinions in this article are my own.

Hurray, XDP is turning ten! It seems like just yesterday our baby was taking its first steps as a fledgling technology with a lot of promise, and yet with a natural apprehension for its future. Ten years later it’s an established major networking feature supported by Linux and Windows (with some ongoing FreeBSD development). It’s used in several commercial products and open source projects including Cilium, Katran, & Cloudflare’s DDoS mitigation pipeline. Doing a quick search for “XDP” on the Netdev mailing list shows it’s mentioned at least four or five times a day. I think it’s safe to say that little XDP has grown up to surpass our expectations!

So have a piece cake and a don party hat because today we’re celebrating XDP! This will be part retrospective, part highlight reel, and part looking towards the future…

Humble beginnings

XDP was born on February 11, 2016 in Seville Spain. Why can I be so specific about the date? Well, it’s kind of a funny story…

February 11, 2016 was the very day that they announced FD.io would join the Linux Foundation. FD.io was a project all about kernel bypass. The idea was to use technologies like DPDK and VPP for userspace networking. Ostensibly, this improves performance over the Linux kernel networking stack by 10x or even 100x.

Coincidentally, that same week we were holding the Netdev 1.1 conference in Seville, Spain. This is a long running conference primarily focussed on all things in Linux kernel networking. Many of the Linux networking folks were gathered together when we got wind of the FD.io announcement. FD.io was being touted as an alternative to Linux kernel networking, so naturally the FD.io joining Linux Foundation caught our attention ;-).

In retrospect, it wasn’t so much that FD.io was getting into kernel bypass, we already knew about ongoing efforts in DPDK. The thing that annoyed us kernel developers were some of the outlandish performance claims. Sure, there’s overhead going through the kernel, several layers of abstraction, and the kernel is feature laden with things like IPtables. These slow down processing a bit, but kernel bypass being 10x or 100x better performance? That didn’t make any sense! In any case, that got the wheels turning…

That same night we had our social event. I believe it was a walking tour of Seville and got to see some castles and parks (Seville is quite a beautiful town!). To get to the event, we took a bus that followed the Guadalupe river. It was on the bus ride that Alexei Starovoitov and myself hashed out the basics of what would become XDP.

The initial premise was (and still is) really simple. We would add a hook in the low level NIC driver receive path to execute the user’s code. Right after a packet is dequeued the driver would call the user’s function to process it. This call happens completely in the kernel so there’s no context switch overhead, and it happens before an skbuf has been allocated or any other core stack processing — XDP is effectively kernel bypass within the kernel.

Enter eBPF

The missing piece of the puzzle was how to make it easy for users to insert their code into the kernel. This is a non-trivial question! The kernel is an unforgiving environment, one mistake and the whole system can crash or critical data can be corrupted. Fortunately, we didn’t have to ponder this long — eBPF was the obvious answer.

eBPF predates XDP by a couple of years having been invented by Alexei Starovoitov and Daniel Borkmann in 2014. The basic idea is that user programs can run in a privileged context such as the operating system kernel. It’s the successor to the Berkeley Packet Filter (BPF, with the “e” meaning “extended”). eBPF is supported by several operating systems including Linux and WIndows, and there’s a bunch of use cases including logging, security, performance tuning, profiling, and networking (XDP).

IMO, eBPF is the most significant advancement in operating systems in the past twenty years. I think the design is brilliant for two reasons: 1) it’s easy to program, and 2) it’s safe to run in the kernel.

One of the first things done for eBPF was to add support in LLVM to output eBPF bytecode. An eBPF program is written in Restricted C that is basically C code with limitations on loops, global variables, variadic functions, floating-point, and passing structures as function arguments. Using a familiar and ubiquitous language makes eBPF easy to program — you don’t need to be a kernel guru to write meaningful eBPF code!

eBPF is used to safely and efficiently extend the capabilities of the kernel at runtime without requiring changes to kernel source code or loading kernel modules. Safety is provided through an in-kernel verifier that performs static code analysis and rejects programs which might crash, hang or otherwise interfere with the kernel negatively. The verifier is an amazing tool. Sometimes programmers may curse it because we need “appease the verifier”, but when you think about what it’s doing and the functionality being enabled it really is darn impressive!

(Fun fact)

We only decide on the name XDP, i.e. eXpress DataPath, a few weeks later. My original idea was to call it Ludicrous Mode Networking, or LMN, since about that time Tesla was coming up with Ludicrous Mode for their Model S (0 to 60 in 2.3 seconds!). The problem was that LMN sounds a little too much like “lemon”. Fortunately, we decided on XDP which is a much better name and ‘X’ is the fastest letter after all!

Press enter or click to view image in full size

XDP Highlight Reel

I’m not going to do a deep dive on how to program XDP, there’s plenty of introductory guides and tutorials on that, I will however point out some of the cool features of XDP.

The XDP function

The core of an XDP program is a function that serves as the entry point into the user’s program. The prototype of the function a user would write as an XDP function in the kernel is:

int xdp_prog_simple(struct xdp_md *ctx);

The argument to an XDP function is an xdp_md (XDP metadata) structure. This contains various metadata including pointers to the packet data. The xdp_md structure is defined as:

struct xdp_md {
        __u32 data;
        __u32 data_end;
        __u32 data_meta;
        /* Below access go through struct xdp_rxq_info */
        __u32 ingress_ifindex;
        __u32 rx_queue_index;
#ifndef __KERNEL__
        __u32 egress_ifindex;
#endif

The fields in the struct xdp_md provide essential information for packet manipulation:

data: An offset (or pointer in some contexts) to the start of the raw packet data within the memory buffer.
data_end: An offset (or pointer) to the end of the packet data. XDP programs must perform bounds checks to ensure they do not read beyond data_end.
data_meta: An offset used for storing extra metadata that can be pushed along with the packet.
ingress_ifindex: The index of the network interface where the packet was received.
rx_queue_index: The index of the receive queue the packet arrived on.
egress_ifindex: (Only available when processing AF_XDP packets from a userspace program to redirect back into the kernel stack) The interface index the packet is intended to exit through.

An XDP function returns an XDP Return Code. There are five possible codes:

XDP_DROP Drops the packet immediately at the driver level, making it the fastest way to discard traffic (e.g., for DDoS mitigation).
XDP_PASS Passes the packet up to the standard Linux networking stack for normal processing (e.g., TCP/IP handling).
XDP_TX Bounces the packet back out the same network interface card (NIC) it arrived on. This is commonly used for load balancers or implementing direct return functionality.
XDP_REDIRECTRedirects the packet to another interface, a different CPU core, or a user-space socket (AF_XDP).
XDP_ABORTEDIndicates an eBPF program error, causing the packet to be dropped and a tracepoint event to be triggered.

The fact that the XDP function takes just a single argument and there are only five return codes amazes me. This is a testimonial to elegance of design and avoidance of feature bloat that has plagued other projects.

Press enter or click to view image in full size

Basic XDP function. For each packet received on a NIC device queue the XDP/eBPF function runs. The function processes the packet based on the metadata in the xdp_md argument. The program can arrange to forward the packet, drop it, redirect the packet to be sent on another interface, or receive it on the local stack.

The wider world of XDP

A big advantage of XDP is that it’s not kernel bypass. XDP is integrated into the kernel stack so an XDP program can leverage kernel facilities. It can also perform preprocessing of packets that are destined for the normal stack or userspace (this is where XDP_DROP comes in handy).

XDP as part of the kernel networking stack ecosystem.

Helper functions

Helper functions are functions defined by the kernel which can be invoked from eBPF programs. These helper functions allow eBPF programs to interact with the kernel as if calling a function. The kernel places restrictions on the usage of these helper functions to prevent misuse.

There’s a bunch of helper functions and more will likely be defined. This includes helpers to assist in manipulating metadata, packet redirect, computing hashes, computing checksums, utilities, timers and more.

eBPF maps

eBPF maps provide a way for eBPF programs to communicate with each other (kernel space) and with userspace.

When both kernel and userspace access the same maps they will need a common understanding of the key and value structures in memory. This can work if both programs are written in C and they share a header. Otherwise, both the user space language and the kernel space structures must understand the key-value structures byte-for-byte.

Maps are incredibly powerful and enable advanced functionality in XDP. They come in a variety of types, each of which works in a slightly different way. For XDP, maps provide data structures lookups one might do in the network processing. Some of the more relevant maps are:

BPF_MAP_TYPE_HASH provides general purpose hash map storage. Both the key and the value can be structs, allowing for composite keys and values.
BPF_MAP_TYPE_LPM_TRIE provides a longest prefix match algorithm that can be used to match IP addresses to a stored set of prefixes. This map type could be used for route lookups.
BPF_MAP_TYPE_ARRAY_OF_MAPS an array of maps is used, where each "tuple" in the array is an inner hash map that stores flow rules sharing the same mask. The type in conjunction with a hash map can be used to implement a ternary table by leveraging the Tuple Space Search (TSS) algorithm.

A simple XDP program

No birthday party would be complete without party games! So let’s have some fun and look at an XDP application. Below is a simple XDP program that counts the number of packets received for each EtherType.

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u16); // EtherType
    __type(value, __u64); // Packet count
    __uint(max_entries, 1024);
} pkt_stats_map SEC(".maps");
 /* SEC("xdp") is a macro that places the function in the XDP section
 * of the eBPF object file.
 */
SEC("xdp") 
int xdp_prog_simple(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth = data; // Pointer to the IP header
    long *count;
    __u16 proto;
    __u64 init_val = 0;
        // Basic boundary check to ensure the header is within packet limits
    if ((void *)eth + sizeof(*eth) > data_end)
        return XDP_ABORTED; // Abort if packet is malformed or truncated
    proto = bpf_ntohs(eth->h_proto);
    count = bpf_map_lookup_elem(&pkt_stats_map, &proto);
    if (count)
        __sync_fetch_and_add(count, 1);
    else
        bpf_map_update_elem(&pkt_stats_map, &proto, &init_val, BPF_ANY);
    return XDP_PASS; 
}
/*
 * The license declaration is required to use most kernel helper functions.
 * "GPL" is a common choice for eBPF programs intended for general use.
 */
char _license[] SEC("license") = "GPL";

The program uses a hash map to keep counters for various EtherTypes. When a packet is received the counter in the hash entry for it EtherType is incremented.

The program can be compiled using the clang compiler (note to compile you’ll need to install the right packages including kernel headers):

$ clang -O2 -g -Wall -target bpf -c xdp_proto.c -o xdp_proto.o

Next we can use the ip command to attach the XDP program to an interface like (for our example we’ll attach to the loopback interface):

$ sudo ip link set dev lo xdp obj xdp_pass.o sec xdp

We can see the program running on the interface using the ip command:

$ sudo ip link show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    prog/xdp id 60 tag 3fed666785a57e53 jited

At this point every packet received on the interface, the loopback interface in our example, is processed by our XDP program. Let’s generate some packets:

$ ping -c 3 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.073 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.071 ms--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2030ms
rtt min/avg/max/mdev = 0.071/0.076/0.085/0.006 ms
$ ping -c 3 ::1
PING ::1 (::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from ::1: icmp_seq=2 ttl=64 time=0.070 ms
64 bytes from ::1: icmp_seq=3 ttl=64 time=0.075 ms
--- ::1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2076ms
rtt min/avg/max/mdev = 0.067/0.070/0.075/0.003 ms

Next we need to get the identifier os of our map. This can be done using bpftool:

$ sudo bpftool map show
1: hash  name s_snapd_desktop  flags 0x0
    key 9B  value 1B  max_entries 500  memlock 46816B
27: hash  name s_firmware_upda  flags 0x0
    key 9B  value 1B  max_entries 500  memlock 46816B
44: hash  name pkt_stats_map  flags 0x0
    key 2B  value 8B  max_entries 1024  memlock 84416B
    btf_id 275
46: array  name xdp_prot.rodata  flags 0x80
    key 4B  value 21B  max_entries 1  memlock 288B
    btf_id 275  frozen

This shows that pkt_stats_map has ID 44. We can show the contentts of the map:

$ sudo bpftool map dump id 44
[{
        "key": 2048,
        "value": 38
    },{
        "key": 34525,
        "value": 7
    }
]

The entry with key 2048 is IPv4 (0x800 in hex), we received thirty-eight IPv4 packets on the loop back interface. The entry with key 34525 is IPv6 (0x86DD in hex) so we received seven IPv6 packets.

We can detach the XDP program using the ip command:

sudo ip link set dev lo xdp off

The next ten years

Thinking about eBPF and XDP, I’ve come up with my wish list for where they will go in the next ten years. Some of these might seem overly ambitious, but I will point out that considering how much progress we made in the first ten years I believe these are entirely achievable. So without further adieu, here’s my predictions for the future of XDP/eBPF.

eBPF performs more and more core processing. Let’s rip out core kernel code and replace it with XDP/eBPF. One of my proposals is to replace the flow dissector with an XDP program.
Hardware seamlessly becomes part of the kernel. I talk about this a lot and it’s one of the core motivations for XDP2. If we do it right, this solves the kernel offload conundrum and that’s where we might get a true 10x performance improvement!
No new transport protocols in kernel code. This is already sorta happening with new protocols like QUIC that are designed to run in userspace. But if we implement new protocols in XDP then we can have the flexibility of a userspace programming, but still be able to hook directly into internal kernel APIs like the file system and RDMA.
AI writes a lot of protocol and datapath code. XDP makes it simple for users to program the networking stack, so therefore it’s going to make it easy for AI also. Vibe coding the stack here we come!
Obsolete kernel rebases If you ever had the “joy” of rebasing the kernel kernel on thousands of production machines then you know why I really want this :-). Kernel rebase is miserable! Anything we can do with XDP or other technologies to mitigate the pain will be well appreciated!