eBPF will help solve service mesh by getting rid of sidecars
isovalent.comHonestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.
Since some number of persistent connections will get force terminated on scale down or node replacement events...
Cilium and eBPF looks like a pretty good solution to this though since you can then advertise your pods directly on the network and load balance those instead of every node.
> Honestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.
There can be a difference, if your LoadBalancer-type service integration is well implemented. The externalTrafficPolicy knob determines whether all nodes should attract traffic from outside or only nodes that contain pods backing this service. For example, metallb (which attracts traffic by /32 BGP announcements to given external peers) will do this correctly.
Within the cluster itself, only nodes which have pods backing a given service will be part of the iptables/ipvs/... Pod->Service->Pod mesh, so you won't end up with scenic routes anyway. Same for Pod->Pod networking, as these addresses are already clustered by host node.
How do you keep ecmp hashing stable between rollouts?
If you're asking about connection stability in general:
- Ideally, you avoid it in your application design.
- If you need it, you set up SIGTERM handling in the application to wait for all connections to close before the process exits. You also set up "connection draining" at the load balancer to keep existing sessions to terminating Pods open but send new sessions to the new Pods. The tradeoff is that rollouts take much longer- if the session time is unbounded, you may need to enforce a deadline to break connections eventua.
You dont just wait until all connections exit, you first need to withdraw bgp announcement to the edge router, then start the wait. It’s not that simple with metal LBs. On the other hand it’s not that simple with cloud LBs either bc they also break long tcp streams when they please
We reused the LB as much as possible to avoid the BGP thing. There's a thing called MetalLB designed around that though.
Pretty sure metallb will have same problem when you need to rotate nodes in bgp mode
You don’t :).
To do it properly you want a maglev-style layer that allows for withdrawals/drains of application servers with minimal disruption thanks to a minimum disruption maglev-style hash and draining support. This will allow you to first drain the given application server (continue maintaining existing connections, but send new ones to a secondaries for that part shard) before fully taking down the instance.
Sounds like Apache's graceful-restart.
Sort of. Processes on the same node (graceful restart) vs processes on different nodes (maglev).
Eh, a signal is a signal even if it's an RPC, but my point was to focus on the "waiting for something to end or empty before restarting" part.
ECMP hashing would be between the edge router and the IP of the LBs advertising VIPs no? The LB would maintain the mappings between the VIPs and the nodePort IPs of worker nodes that have a local service Endpoint for the requested service. I don't think this would be any different than it is without Kubernetes or am I completely misunderstanding your question?
q3k has mentioned metallb+bgp, which is basically in-cluster implementation of LoadBalancer Service type (bgp speakers are running on k8s nodes and announce /32 routes to nodes based on configuration), but it doesn't provide an answer for "stabilizing" ecmp connections when there are changes to backends. There has to be something "behind" metallb[1] that will handle not only stable hashing for connections, but keep forwarding "in-flight" flows (like established tcp sessions) to correct backends, even if packets arrive on different ingress nodes. It seems cilium has some solution for that[2] (by both bundling metallb, and having maglev-based loadbalancer implementation) but I haven't had time to dig into it, so I was curious if someone else has solved it and would be willing to share stories from the front. This is one of those rough edges around kubernetes deployments in bare metal environments and I'd love to see what can be done to make it more robust.
[1] metallb only really announces IPs so that "behind" is probably just CNI that actually handles traffic [2] https://cilium.io/blog/2020/11/10/cilium-19#maglev
Ah OK, I missed that this was MetalLB specific. Interesting that Cilium using Google's Maglev which amongst other things handles the issue of ECMP churn when nodes are taken out of service. I remember reading this in the white paper when it came out. I believe Facebook's Katran does similar. Thanks for the link.
That's if you're using a NodePort service, which the documentation explains is for niche use cases such as if you don't have a compatible dedicated load balancer. In most professional setups you do have such a load balancer and can use other types of routing that avoid this.
https://kubernetes.io/docs/concepts/services-networking/serv...
> In most professional setups you do have such a load balancer
May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?
Does IPv6 address any of these issues? It seems to me that IPv6 is capable of providing every component in the system its own globally routable address, identity (mTLS perhaps) and transparent encryption with no extra sidecars, eBPF pieces, etc.
Ingresses on EKS will set up an ALB that sends traffic directly to pods instead of nodes (basically skips the whole K8s Service/NodePort networking setup). You have to use ` alb.ingress.kubernetes.io/target-type: ip` as an annotation I think (see https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress...).
> May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?
The AWS cloud controller will automatically set up an ALB for you if you configure a LoadBalancer service in Kubernetes. I've also done custom setups with AWS NLBs.
> Does IPv6 address any of these issues?
It could address some issues- you could conceivably create a CNI plugin which allocates an externally addressable IP to your Pods. Although you would probably still want a load balancer for custom routing rules and the improved reliability over DNS round robin.
Are ALB/NLB employed to handle traffic between pods in the same cluster? Or have I misunderstood the whole discussion?
My take on the 'eBPF will help solve service mesh' proposal is that it deals with not only ingress/egress traffic (where ALB/NLB are typically employed) but all traffic, including traffic between pods in a cluster. This is where my interests lay.
> Are ALB/NLB employed to handle traffic between pods in the same cluster?
You can choose to do so, or you can communicate directly via the built-in Kubernetes service discovery and CNI overlay network. There are use cases for both.
Whether load balancer can or can-not tell the nodes apart depends on load balancer and method you use to expose your service to it, as well as what kind of networking setup you use (i.e. is pod networking sensibly exposed to load balancer or ... weirdly)
Each "Service" object provides (by default, can be disabled) load-balanced IP address that by default uses kube-proxy as you described, a DNS A record pointing to said address, DNS SRV records pointing to actual direct connections (whether NodePorts or PodIP/port combinations) plus API access to get the same data out.
There are even replacement kube-proxy implementations that route everything through F5 load balancer boxes, but they are less known.
This is a concern only if you have ungraceful node termination Ie you suddenly yoink the node. In most cases when you terminate the node, k8s will (attempt to) cordon and drain the nodes, letting the pods gracefully terminate the connections before getting evicted.
If you didn’t have k8s and just used an autoscaling group of VMs you would have the same issue…
So instead of making the applications use a good RPC library, we're going to shove more crap into the kernel? No thanks, from a security context and complexity perspective.
Per https://blog.dave.tf/post/new-kubernetes/ , the way that this was solved in Borg was:
> "Borg solves that complexity by fiat, decreeing that Thou Shalt Use Our Client Libraries For Everything, so there’s an obvious point at which to plug in arbitrarily fancy service discovery and load-balancing. "
Which seems like a better solution, if requiring some reengineering of apps.
The complexity is an issue (but sidecars are plenty complex too), but the security not so much. BPF C is incredibly limiting (you can't even have loops if the verifier can't prove to its satisfaction that the loop has a low static bound). It's nothing at all like writing kernel C.
You don't have to use C.
There are two projects that enable writing eBPF with Rust [1][2]. I'm sure there is an equivalent with nicer wrappers for C++.
It doesn't make any difference which language you use; the security promises are coming from the verifier, which is analyzing the CFG of the compiled program. C is what most people use, since the underlying APIs are in C, and since the verifier is so limiting that most high-level constructions are off the table.
Sure, I was not implying that Rust would have any security benefits fir eBPF.
Just that you can even write eBPF code in more convenient languages.
This has come up here a bunch of times (we do a lot of work in Rust). I've been a little skeptical that Rust is a win here, for basically the reason I gave upthread: you can't really do much with Rust in eBPF, because the verifier won't let you; it seems to me like you'd be writing a dialect of Rust-shaped C. But we did a recent work sample challenge for Rust candidates that included an eBPF component, and a couple good submissions used Rust eBPF, so maybe I'm wrong about that.
I'm also biased because I love writing C code (I know, both viscerally and intellectually, that I should virtually never do so; eBPF is the one sane exception!)
I don't think client libraries are the answer. If you only have one technology stack (say, Java and Spring) and only use one application-layer network protocol (say, HTTP), then maybe it's fine.
But once you have more than one language or framework, you need to write more and more of these libraries. And what happens if it's not just HTTP? What if you need to speak Redis, MySQL, or some random binary protocol? Do you write client libraries for those too? Maybe a company like Google has the scale to do this, but most orgs do not. But even then, what if you have to run some vendor-supplied code that you don't even have source for?
I agree with you that shoving more of this into the kernel isn't desirable, but libraries aren't great. Been there, done that, don't feel like doing it again. I'd rather stick with sidecars.
If you are in a position where you can do that then great. Most folks out there are in a position where they need to run arbitrary applications delivered by vendors without an ability to modify them.
The second aspect is that this can get extremely expensive if your applications are written in a wide number of language frameworks. That's obviously different at Google where the number of languages can be restricted and standardized.
But even then, you could also link a TCP library into your app. Why don't you?
I'm not necessarily advocating for the approach described in the article but it wouldn't worry me from a security perspective. The security model of eBPF is pretty impressive. The security issues arising from engineers struggling to keep the entire model in their head would concern me though.
The industry is moving away from the client library approach. This is possible in a place like Google where they force folks to write software in one of four languages (C++, Java, Go, Python) but doesn't scale to a broader ecosystem.
It sure scales, I am yet to work in organisations where everything goes.
There are a set of sanctioned languages and that is about it.
The subtle aspect of the comment you're replying to is that _they write everything_.
Hard to cram a new library into some closed source vendor app.
Depends how it was written and made extensible.
In a world without (D)COM, I find it's much, much harder to make common base libraries and force people to use them, especially if you can't also force limit the set of toolchains used in the environment.
The network is the base library - that is the shift you are seeing. You make a call out to a network address with a specific protocol.
Also, as an aside, I think WebAssembly has the potential to shift this back. In a world where libraries and programs are compiled to WebAssembly, it doesn't matter what their source language was, and as such, the client library based approach might swing back into vogue.
> The network is the base library
you remind me of the 20+ years ago Sun Microsystems assertion "The Network IS the Computer".
citation: https://www.networkcomputing.com/cloud-infrastructure/networ...
Well they did put the dot in dot-com.
WASM isn't a valid target for many languages, that's one thing.
Two, the case is about the library to interact with the network, so... There's also implementing the protocols.
In addition to whether or not all of your various dev teams preferred languages have a supported client SDK, you also have the build vs. buy issue if you're plugging COTS applications into your service mesh, there is no way to force a third party vendor to reengineer their application specifically for you.
This probably dictates a lot of Google's famous "not invented here" behavior, but most organizations can't afford to just write their entire toolchain from scratch and need to use applications developed by third parties.
It is the technically better solution IMO/IME, too.
But that doesn't work when you're trying to sell enterprises the idea of 'just move your workloads to Kubernetes!'. :)
> a good RPC library
I like that approach. If you use client libraries, new RPC mechanisms are "free" to implement (until you need to troubleshoot upgrades). It's also an argument against statically linking.
For instance, if running services on the same machine, io-uring can probably be used? (I'm a noob at this). eBPF for packet switching/forwarding between different hosts, etc.
This may no longer be the case, but back at Google I remember one day having my java library no longer using the client library logger, but spawning some other app and talking (sending logs to it). That other app used to be fat-client, linked in our app, supported by another team. First I was wtf.. Then it hit me - this other team can update their "logging" binary at different cycle than us (hence we don't have to be on the same "build" cycle). All they needed to do for us is provide with very "thin" and rarelly changing interface library. And they can write it in any language they like (Java, c++, go, rust, etc.)
Also no need to be .so/ (or .dll/.dylib) - just some quick IPC to send messages around. Actually can be better. For one, if their app is still buffering messages, my app can exit, while theirs still run. Or security reasons (or not having to think about these), etc. etc. So still statically linked but processes talking to each other. (Granted does not always work for some special apps, like audio/video plugins, but I think works fine for the case above).
It does feel a bit like we're trying to monkey patch compiled code but the benefits are pretty clear.
I would argue pretty strenuously that this is not what is being done.
The sockets layer is becoming a facade which can guarantee additional things to applications which are compiled against it, and you've got dependency injection here so that the application layer can be written agnostically and not care about any of those concerns at all.
Well ok, but the dependency inject is not statically checked in this case, its changed dynamically, perhaps while the application is running. Is that not similar to a monkey patch?
That’s great if you write all your software. If you want to use someone else’s thing then you have to wrap it in that magic everywhere client.
What if a client library does not yet exist for your language?
In a large orga, you limit the languages available for projects to well supported ones internally, ie. to those that are known to have a port of the RPC/metrics/status/discovery library. Also makes it easier to have everything under a single build system, under a single set of code styles, etc.
If some developers want to use some new language, they have to first in put in the effort by a) demonstrating the business case of using a new language and allocating resources to integrate it into the ecosystem b) porting all the shared codebase to that new language.
Absolutely. I was thinking what if there's a good business reason to use a different language that's not the norm for your org. Then you're stuck with an infra problem preventing you from using the right tool for the job.
Of course, this is the exception to the rule you described well :)
I don't think of it as an infra problem, but as an early manifestation of effort that would arise later on, anyway: long-term maintenance of that new language. You need people who know the language to integrate it well with the rest of the codebase, people who can perform maintenance on language-related tasks, people who can train other people on this language, ... These are all problems you'd have later on, but are usually handwaved away as trivial.
Throughout my career nearly every single company I've worked in had That One Codebase written by That One Brilliant Programmer in That One Weird Language that no-one maintains because the original author since left, the language turns out to be dead and because it's extremely expensive to hire or train more people to grok that language just for this project.
There are only 5 languages. JavaScript, C++, Java, Python, C#
This is basically the same set of languages people were writing 20 years ago and will probably be the same set of languages people will write in 20 years from now.
It really depends on your domain. I haven't seen C# a lot, nor python, in some orgs.
For some (like me), it's more a superset of C, assembly, bash, maybe lisp, python and matlab.
For others, it's going to be JavaScript, PHP, CSS, HTML..
I agree though that a library is usually domain-specific, and that you can probably easily identify the subset of languages that you really need official bindings for (thereby making my comment a bit useless, sorry for the noise).
The big secret is that sidecars can only help so much. If you want distributed tracing, the service mesh can't propagate traces into your application (so if service A calls service B which calls service C, you'll never see that end to end with a mesh of sidecars). mTLS is similar; it's great to encrypt your internal traffic on the wire, but that needs to get propagated up to the application to make internal authorization decisions. (I suppose in some sense I like to make sure that "kubectl port-forward" doesn't have magical enhanced privileges, which it does if your app is oblivious to the mTLS going on in the background. You could disable that specifically in your k8s setup, but generally security through remembering to disable default features seems like a losing battle to me. Easier to have the app say "yeah you need a key". Just make sure you build the feature to let oncall get a key, or they will be very sad.)
For that reason, I really do think that this is a temporary hack while client libraries are brought up to speed in popular languages. It is really easy to sell stuff with "just add another component to your house of cards to get feature X", but eventually it's all too much and you'll have to just edit your code.
I personally don't use service meshes. I have played with Istio but the code is legitimately awful, so the anecdotes of "I've never seen it work" make perfect sense to me. I have, in fact, never seen it work. (Read the xDS spec, then read Istio's implementation. Errors? Just throw them away! That's the core goal of the project, it seems. I wrote my own xDS implementation that ... handles errors and NACKs correctly. Wow, such an engineering marvel and so difficult...)
I do stick Envoy in front of things when it seems appropriate. For example, I'll put Envoy in front of a split frontend/backend application to provide one endpoint that serves both the frontend or backend. That way production is identical to your local development environment, avoiding surprises at the worst possible time. I also put it in front of applications that I don't feel like editing and rebuilding to get metrics and traces.
The one feature that I've been missing from service meshes, Kubernetes networking plugins, etc. is the ability to make all traffic leave the cluster through a single set of services, who can see the cleartext of TLS transactions. (I looked at Istio specifically, because it does have EgressGateways, but it's implemented at the TCP level and not the HTTP level. So you don't see outgoing URLs, just outgoing IP addresses. And if someone is exfiltrating data, you can't log that.) My biggest concern with running things in production is not so much internal security, though that is a big concern, but rather "is my cluster abusing someone else". That's the sort of thing that gets your cloud account shut down without appeal, and I feel like I don't have good tooling to stop that right now.
> If you want distributed tracing, the service mesh can't propagate traces into your application (so if service A calls service B which calls service C, you'll never see that end to end with a mesh of sidecars)
Why not? AFAIK traces are sent from the instrumented app to some tracing backend, and a trace-id is carried over via an HTTP header from the entry point of the request until the last service that takes part in that request. Why a sidecar/mesh would break this?
I think the point is that the service mesh can't do the work of propagation. It needs the client to grab the input header, and attach it to any outbound requests. From the perspective of the service mesh, the service is handling X requests, and Y requests are being sent outbound. It doesn't know how each outbound request maps to an input.
So now all of the sudden we do need a client library for each service in order to make sure the header is being propagated correctly.
But tracing cannot be done anyway with a sidecar and no modification to the service code anyway. With a sidecar (or eBPF) you will get blackbox metrics for free (connections throughput, latency, errors etc) but tracing it needs to be done inside the code (even automatically by some third-party library/addon or instrumenting manually). I understand the point that, once you are there instrumenting for tracing, you can also instrument for metrics and not use a sidecar. But to be fair distributed tracing is something that's catching on only now and metrics give you already some kind of visibility that it's better to have that not to have. On top of that you can add tracing and improve the observability.
You described the problem correctly.
I think it's a stretch to say that requires a client library, though. It should be straightforward to have whatever library you are already using for http requests pass those headers through.
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...
> a trace-id is carried over via an HTTP header from the entry point of the request until the last service that takes part in that request.
But it's not, unless you specifically code your services to do that. Which isn't hard, but just means plugging an unmodified service into a service mesh isn't enough.
This. Header trace propagation is a godsend.
I'm sure someone will write leftPad in eBPF any day now.
Indeed. We could even embed a WASM runtime (headless v8?) so one can execute arbitrary JavaScript in-kernel… wait :)
eBPF is far too limited to run a WASM runtime. That's why the proposed article approach is even possible.
> Identity-based Security: Relying on network identifiers to achieve security is no longer sufficient, both the sending and receiving services must be able to authenticate each other based on identities instead of a network identifier.
Kinda semi-offtopic but I am curious to know if anyone has used identity part of a WireGuard setup for this purpose.
So say you have a bunch of machines all connected in a WireGuard VPN. And then instead of your application knowing host names or IP addresses as the primary identifier of other nodes, your application refers to other nodes by their WireGuard public key?
I use WireGuard but haven’t tried anything like that. Don’t know if it would be possible or sensible. Just thinking and wondering.
We're a global platform that runs an intra-fleet WireGuard mesh, so we have authenticated addressing between nodes; we layer a couple dozen lines of BPF C on top of that to extend the authentication model to customer address prefixes. So, effectively, we're using WireGuard as an identity. In fact: we do so explicitly for peering connections to other services.
So yeah, it's a model that can work. It's straightforward for us because we have a lot of granular control over what can get addressed where. It might be trickier if your network model is chaotic.
I too am interested in this.
I long for the day where Kubernetes services, virtual machines, dedicated servers and developer machines can all securely talk to eachother in some kind of service mesh, where security and firewalls can be implemented with "tags".
Tailscale seems to be pretty much this, but while it seems great for the dev/user facing side of things (developer machine connectivity), it doesn't seem like it's suited for the service to service communication side? It would be nice to have one unified connectivity solution with identity based security rather than e.g Consul Connect for services, Tailscale / Wireguard for dev machine connectivity, etc.
>I long for the day where Kubernetes services, virtual machines, dedicated servers and developer machines can all securely talk to eachother in some kind of service mesh, where security and firewalls can be implemented with "tags".
That's exactly what Scalable Group Tags (SGTs) are -
https://tools.ietf.org/id/draft-smith-kandula-sxp-07.html
Cisco implements this as a part of TrustSec
One of the methods that Cilium (which implements this eBPF-based service mesh idea) uses to implementation authentication between workloads is Wireguard. It does exactly what you describe above.
In addition it can also be used to enforce based on service specific keys/certificates as well.
Isn't the Wireguard implementation in Cilium between nodes only, not workloads (pods)?
It can do both. It can authenticate and encrypt all traffic between nodes which then also encrypts all traffic between the pods running on those pods. This is great because it also covers pod to node and all control plane traffic. The encryption can also use specific keys for different services to authenticate and encrypt pod to pod individually.
You'd be adding a whole new layer of what would effectively be dynamic routing. It's doable, but it's not a trivial amount of effort. Especially if you want everything to be transparent and automagic.
There's earlier projects like CJDNS which provide pubkey-addressed networking, but they're limited in usability as they route based on a DHT.
Ziti (Apache) provides bootstrapped* identity based security (and programmable, least privilege overlays).
Disclosure: founder of company which sells SaaS on top of Ziti FOSS.
* https://ziti.dev/blog/bootstrapping-trust-part-5-bootstrappi...
I understand how BPF works for transparently steering TCP connections. But the article mentions gRPC - which means HTTP2. How can the BPF module be a replacement for a proxy here. My understanding is it would need to understand http2 framing and having buffers - which all sound like capabilities that require more than BPF?
Are they implementing a http2 capable proxy in native kernel C code and making APIs to that accessible via bpf?
The model I'm describing contains two pieces: 1) Moving away from sidecars to per-node proxies that can be better integrated into the Linux kernel concept of namespacing instead of artificially injecting them with complicated iptables redirection logic at the network level. 2) Providing the HTTP awareness directly with eBPF using eBPF-based protocol parsers. The parser itself is written in eBPF which has a ton of security benefits because it runs in a sandboxed environment.
We are doing both. Aspect 2) is currently done for HTTP visibility and we will be working on connection splicing and HTTP header mutation going forward.
What does an HTTP parser written in BPF look like? Bounded loops only --- meaning no string libraries --- seems like a hell of a constraint there.
Bounded loop plus 1M instruction limits in the 5.4 kernel (no record at hands about the exact version), gives a large range of supported headers. Also note that these BPF code are on the network level, which is subject to the MTU limit as well, which usually is 1500 and now can be 10s of KBs (65,525 bytes maxmial in theory accroding to https://www.lifewire.com/definition-of-mtu-817948, but my networking knownledge is poor). These makes it possible to effectively handle all possible headers.
HTTP is actually fine.
HTTP2 will be a bigger issue as it has HPACK, and Huffman coding, that would be very complicated to maintain inside BPF runtime. I haven't thought about it closely yet. But based on our experience at http://px.dev, I am not aware of any glaring technical obstacles.
This is interesting and all, but I've also written bounded loop BPF code on 5.6 kernels, and it is not easy to get the verifier to accept seemingly obvious loops. I'm not saying it's impossible, I'm saying I'd like to see what this code actually looks like. I'd be a little shocked if it just looked exactly like Node's HTTP parser.
I need to double verify when was the bounded loop patch got into the kernel, I suppose it's 5.6 as you mentioned above.
What I actually was thinking is that one can write C code and ask the compiler to unroll it.
``` pragma(unroll) for (..., i < 100; ++I) { parsing code } ```
Also the other comment note the stake bookkeeping for HTTP to maintain the state when the parsing spans multiple packets, assuming here we are talking about XDP probes.
One quick idea is to use BPF_TABLE(, uint128_t, some data structure) I haven't tested if uint128_t is OK as key type. And the data structure in the value needs more thoughts. Roughly I am thinking turn any state bookkeeping into some BPF tables, and keyed through whatever data that matches the context. This probably means uint128_t as Ipve/6 address, and a nested map with key as the port. Or combined v4 IP & port.
It'll be interesting. I suppose the code from Isovalent will eventually be open sourced. Or is it already so? Haven't checked yet.
Bounded loops are 5.3. I'm just saying that after like 9 months of development following their introduction, it remained tricky to get the verifier to accept loops with seemingly obvious bounds. I know the feature works (I did ultimately get some loops working!) but I could not have straightforwardly ported userland C code to do it.
You've always been able to unroll loops, but of course you're chewing up code space doing that.
I don't know what BPF_TABLE is (I think it's a BCC-ism?) but BPF hash maps can take 16 byte keys. But notice that you're now writing something that looks nothing at all like Node's HTTP parser.
I'm not doubting that they did this work. I just want to know what it ends up looking like!
Oh nice, we haven't tried bounded loop, because our product is committed to support as old as 4.13.
BPF_TABLE is BCC.
another challenges I can see is wheee to actually store the state of a connection. Even if we just focus on http/1.1 then not all headers will be received at one, and data from previous segments needs to be carried forward. Would it be eBPF maps? Those also seem rather limited for this usecase, and are probably also not extremely fast.
I can imagine getting something to work for http/1.1 - but http/2 with multiplexing and stateful header compression is a completely different beast.
It looks not too different from the majority of HTTP parsers out there written in C. Here is an example of NodeJS [0].
[0] https://github.com/nodejs/http-parser/blob/main/http_parser....
Node's HTTP parser doesn't have to placate the BPF verifier, is why I'm asking.
Doing this with eBPF is definitely an improvement, but when I look at some of the sidecars we run in production, I often wonder why we can't just... integrate them into the application.
You can! There are downsides though for any sufficiently polyglot organization, which is maintaining all the different client SDK's that need to use that.
Sidecars are often useful for platform-centric teams that would like to have access to help manage something like secrets, mTLS, or traffic shaping in the case of Envoy. The team that's responsible for that just needs to maintain a single sidecar rather than all of the potential SDK's for teams.
Especially if you have specific sidecars that only work on a specific infrastructure, for example if you have a Vault sidecar that deals with secrets for your service over EKS IAM permissions, you suddenly can't start your service without a decent amount of mocking and feature flags. Its nice to not have to burden your client code with all of that.
Also, there is a decent amount of work being done on gRPC to speak XDS which also removes the need for the sidecar [0].
Another thing too is that if your main application artifact can be static while your sidecar can react to configuration changes/patches/vulns/updates. Depending on your architecture it can make some components last for years without a change even though the sidecar/surrounding configuration is doing all sorts of stuff. Back when more people ran Java environments there were all sorts of settings you can do with just the JVM without the bytecode moving for how JCE worked which was extraordinarily helpful.
It depends on your environment and architecture combined with how fast you can move especially with third party components. Having the microservice be 'dumb' can save everything.
> Especially if you have specific sidecars that only work on a specific infrastructure, for example if you have a Vault sidecar that deals with secrets for your service over EKS IAM permissions, you suddenly can't start your service without a decent amount of mocking and feature flags. Its nice to not have to burden your client code with all of that.
Could you please elaborate on this? I don't fully understand what you mean. Especially, I don't understand if "Its nice to not have to burden your client code with all of that" applies to a setup with or without sidecars.
Take vault for example. Rather than have to toggle a flag in your service to get a secret, you could have the vault sidecar inject the secret automatically into your container, as opposed to having to pass a configuration flag `USE_VAULT` to your application, which will conditionally have a baked in vault client that fetches your secret for you.
Your service doesn't really care where the secret comes from, as long it can use that secret to connect to some database, API or whatever. So IMO it makes your application code a bit cleaner knowing that it doesn't have to worry about where to fetch a secret from.
This is actually exactly the case I was thinking of. We have a few applications that use vault in a much more in-depth way than fetching a couple of secrets, so have the need to interact with it directly. We then have the much more common case of applications that use vault to fetch their database credentials, and for those we use a sidecar to fetch them at startup and another to renew them every 30 minutes.
The 2x sidecars do with 150 lines of YAML configuration what could be done with a library and 10 lines of java. And I don't buy the other theoretical benefits either. Easier to update? Each service can reference the library centrally from our monorepo whereas the YAML is copy-pasted to every service. It's also statically type-checked. Polyglot? Yeah, fair, but we're an almost entirely JVM shop.
Some of this could maybe be made easier with something like Kubevela but I don't think you're actually eliminating any complexity that way, just hiding it.
Ok, so you are indeed advocating for the sidecar approach (and on this I fully agree, especially this Vault example)
For a moment I thought you're talking about POSIX directory services.
Author of linkerd argues that splitting this responsibility will improve stability as you'll have a homogeneous interface (sidecar proxy) over a heterogeneous group of pods. Updating a sidecar-container (or using the same across all applications) is possible, whereas if it's integrated into the application you'll encounter much more barriers and need much wider coordination.
There are good reasons more often than not.
Being able to pick up something generic rather than something language-specific.
Not having to do process supervision (which includes handling monitoring and logs) within your application.
Not making the application lifecycle subservient to needs such as log shipping and request rerouting. People get sig traps wrong suprisingly often.
My gut is that using sidecars doesn't really solve these problems straight up. Just moved them to the orchestrator.
Which is not bad. But that area is also often misconfigured for supervision. And trapping signals remains mostly broken in all sidecars.
Like it or not the socket has become the demarcation mechanism we use. Therefore all software ends up deployed as a thing that talks on sockets. Therefore you can't/shouldn't put functionality that belongs on the other end of the socket inside that thing. If you do that it's no longer the kind of thing you wanted (a discrete unit of software that does something). It's now a larger kind of component (software that does something, plus elements of the environment that software runs within). You probably don't want that.
The irony is arguing for monolithic kernels with a pile of such layers on top.
Offtopic: I really like the style of the diagrams. I remember seeing something similar elsewhere. Are this manually drawn or is this the result of some tool?
OP here: It's whimsical.com. I really love it.
Thank you, Thomas! I really admire all that you have done with Cilium.
It's not clear how eBPF will deal with mTLS. I actually asked that when interviewing at a company using eBPF for observability into Kubernetes the answer was they didn't know.
Yea, if you're getting TLS termination at the load balancer prior to k8s ingress then it's pretty nice.
Then you should interview again but with us.
This is not too different from wpa_supplicant used by several operating for key management for wireless networks. The complicated key negotiation and authentication can remain in user space, the encryption of the negotiated key can be done in the kernel (kTLS) or, when eBPF can control both sides, it can even be done without using TLS but encrypting using a network level encapsulation format to it works for non-TCP as well.
Hint: We are hiring.
The answer to this is simple - TLS will start being terminated at the pods themselves. The frontend load balancer will also terminate TLS - to the public sphere, and then will authenticate it's connection to your backends as well. Kubernetes will provide x509 certificates suitable for service-to-service communications to pods automatically.
The work is still in the early phases, so the exact form this will take has yet to be hammered out, but there's broad agreement that this functionality will be first-class in k8s in the future. If you want to keep running proxies for the other feature they provide, great - They'll be able to use the certificates provided by k8s for identity. If you'd like to know more, come to on of the SIGAUTH Meetings :)
I am wondering how would this solve the problem of mTLS while still supporting service level identities? Is it possible to move the mTLS to listeners instead of sidecar or some other mechanism?
From a resource perspective this makes sense but from a security perspective this drives me a little bit crazy. Sidecars aren't just for managing traffic, they're also a good way to automate managing the security context of the pod itself.
The current security model in Istio delivers a pod specific SPIFFE cert to only that pod and pod identity is conveyed via that certificate.
That feels like a whole bunch of eggs in 1 basket.
What the proposed architecture allows is to continue using SPIFFE or another certificate management solution to generate and distribute the certificates but use either a per-node proxy or an eBPF implementation to enforce it. Even if the authentication handshake remains in a proxy but data encryption moves to the kernel then that is a massive benefit from an overhead perspective. This already exists and is called kTLS.
There is a good talk about this (and more) from KubeCon:
Not convinced that this a better solution then just implementing these features as part of the protocol. For example, most languages have libraries that support grpc load balancing.
https://github.com/grpc/proposal/blob/master/A27-xds-global-...