WolfSSL Sucks Too, So Now What?

OpenSSL sucks. The BoringSSL and AWS-LC forks are Googled and Amazoned to death; they don't care about anyone but their own use cases. I can't remember ever having a good experience with software using GnuTLS. LibreSSL is incomplete...

FOREWARD

This post is about the experience of taking a leap of faith and using WolfSSL as a drop-in replacement for an existing Haproxy server which traditionally uses OpenSSL. The WolfSSL project specifically has an OpenSSL API compatibility layer so you can presumably swap out OpenSSL almost anywhere. I encountered some unexplainable errors with it in my initial testing, reported a bug, forgot about it, and moved on. Later I revisited my infra and swapped Haproxy+WolfSSL in, hit the bug again, recognized the bug, remembered I did indeed report this already, and finally followed up on it.

Anthony was responding on the Github issue in near real-time which is pretty interesting as I've never had that happen before. It was more clumsy than being in a proper chat, but if that's how he responds on Github I bet their support contracts have a pretty incredible SLA response time.

The goal of this post was to document a bizarre debugging journey for a problem I could not reproduce with anything except one language (Erlang/OTP via Elixir) and also to make it clear to Elixir/Erlang devs searching the web in the future how to identify the root cause of this error. In the process I learned that TLS 1.3 has this feature I didn't realize even existed, WolfSSL 5.8.4 doesn't handle it properly, and a workaround.

The experience of swapping WolfSSL in and running into unexplainable errors is what sucked. The inability to successfully communicate the urgency and importance of fixing this sucked. I don't have any interest in arguing with people on Github especially when I'm not a TLS expert (why listen to me?), so I walked away.

Also, orange site commenters need to get a life. They have amplified this post to point that it's really dragging WolfSSL itself through the mud.

It sounds like WolfSSL is going to get this sorted out pretty quick which is great to hear. Swapping out a TLS library for a server is a pretty risky change, so it's no wonder we nearly have a monoculture in this area. I dream of a world where we have several full-featured production-ready options other than OpenSSL or its forks. Perhaps we're almost there and this is one of the last roadbumps?

What happened now?

Last year an article from Haproxy about how terribly slow OpenSSL has become was published. This made the rounds a few times, and I had an itch to scratch so I helped enable FreeBSD to package a variant of Haproxy built against WolfSSL. This seemed like an easy way to get people wider exposure to WolfSSL as it's unlikely we'll see this happen on most Linux distros, so in reality the only people who are experiencing a WolfSSL-backed Haproxy are people who know what they're getting into and custom built it themselves. I haven't actually checked if Arch, Gentoo, Nix, etc have done similar, but they'd be the easiest to produce a similar haproxy-wolfssl package.

So I did it, and I ran WolfSSL in a few places, and then I hit a bug. I reported the bug, forgot about it, and moved on. And then I hit it again and was motivated to actually figure it out. So I reopened the bug and fumbled my way through debugging the issue until the root cause was identified.

TLS 1.3 is defined in RFC 8446. It works quite a bit different from TLS 1.2 which caused them no end of issues, such that they documented "The design of TLS 1.3 was constrained by widely deployed non-compliant TLS middleboxes".

Ahh yes, the infamous middleboxes. Great. Those invisible pieces of garbage that can tamper with your traffic and you'll generally never know they exist until they cause you grief. And they will.

Middlebox Hell

Hell is definitely a place where middleboxes were invented and no amount of wishcasting will remove them from existence. Although maybe some Etsy witches could provide some guidance as they have incredible luck solving problems...

Anyway, so we have all these middleboxes and they suck and we want the network to have better security guarantees than TLS 1.2 but the boxes only understand TLS 1.2 and TLS 1.3 can't exist if the boxes break them so TLS 1.3 has to be able to pretend to be TLS 1.2. That's where we're at with this.

So the authors hemmed and hawwed about this and came up with a solution: Middlebox Compatibility Mode.

Essentially, clients can optionally set a non-empty session ID in the ClientHello to fool the middleboxes, and the client and server exchange dummy change_cipher_spec records. This is useless and just adds latency to establishing the TLS session, but it will work. Fair.

The Upside Down

The RFC is pretty clear about how this is all meant to play out.

RFC 8446

This "compatibility mode" is partially negotiated: the client can opt to provide a session ID or not, and the server has to echo it.

and

RFC 8446

if the client sends a non-empty session ID, the server MUST send the change_cipher_spec as described in this appendix.

But WolfSSL says "thanks but no thanks". The entire middlebox compatibility functionality is gated behind compiling the library with WOLFSSL_TLS13_MIDDLEBOX_COMPAT. [Update: fresh rounding of testing, this feature DOES work when enabled. Fault was discovered in my test environment.]

So the current state is that WolfSSL cannot be trusted to work correctly with TLS 1.3 clients unless the library was specifically built with this flag enabled. The normal TLS 1.3 configure flags aren't enough. So by default, interoperability depends on how forgiving the clients are, but you shouldn't be forgiving when implementing the RFC correctly... The GitHub issue comment left at the end leads me to believe that they aren't really interested in RFC compliance. There isn't a middleground here or a "different way" of implementing middlebox compatibility. It's either RFC compliant or not. And they're not. Asking me to open a new issue to discuss this behavior instead of it being a high priority for them to open up a new issue internally to fix this is odd. I'm not here to do their homework for them.

Note

Correction: previous edit mentioned WolfSSL is owned by ARM here, but I've mentally swapped PolarSSL for WolfSSL. Whoops.

This sucks extra hard because WolfSSL is also used in a lot of embedded devices. The right thing to do from a security PoV should be to always enable Middlebox Compatibility Mode on your TLS 1.3 clients to increase the chances that your TLS 1.3 handshake will be successfully established. But now we can't do that. Should we wait a few years for WolfSSL to have releases out there with this fixed? Well, since it's often used in embedded devices too you might want to wait a decade before you can use this feature blindly if you don't control both the client and server deployments. Big yikes.

The Plaintiff

Currently I've only identified one victim of this decision, but there's bound to be more out there. Erlang/OTP has its own ssl library implementation and you can rightfully assume that they've taken Joe's advice to heart when adding TLS 1.3 support:

Joe Armstrong

Make it work, then make it beautiful, then if you really, really have to, make it fast.

So to cover their butts, they opted to enable middlebox_comp_mode by default. (If you want it fast and you know it's safe to do so -- turn it off)

And now every Elixir/Erlang/etc HTTP client fails to be able to connect to a WolfSSL HTTPS server if TLS 1.3 is available.

Where Do We Go From Here?

OpenBSD was probably right. We just need to get people to focus on LibreSSL and forget about these other libraries. As Haproxy noted, it's not a victim of the OpenSSL 3.0 screwups because they forked earlier, but it's missing some optimizations. I think that's probably a fair trade-off and the gaps will be filled in due time.

So don't be like me. This was hubris I guess. Sure, I thought I could be clever and have faster TLS termination for my websites, but all it did was lead me to wasting a lot of time learning about something I really didn't care to know, and then writing this stupid blog post. You've been warned.

Elixir PoC

A PoC for Elixir 1.17.3 (compiled with Erlang/OTP 26) is as simple as below:

#!/usr/bin/env elixir

url = "https://some-wolfssl-endpoint"

url = String.to_charlist(url)

{:ok, _} = Application.ensure_all_started(:inets)
{:ok, _} = Application.ensure_all_started(:ssl)

:logger.set_application_level(:ssl, :debug)

http_options =
  [
    ssl: [
      verify: :verify_peer,
      cacerts: :public_key.cacerts_get(),
      depth: 2,
      customize_hostname_check: [
        match_fun: :public_key.pkix_verify_hostname_match_fun(:https)
      ],
      versions: [:"tlsv1.2", :"tlsv1.3"],
      middlebox_comp_mode: true
    ]
  ]

options = [body_format: :binary]

:httpc.request(:get, {url, []}, http_options, options)

The error is going to look like this:

11:00:44.996 [warning] Description: ~c"Failed to assert middlebox server message"
     Reason: [missing: {:change_cipher_spec, 1}]

11:00:45.014 [notice] TLS :client: In state :hello_middlebox_assert at ssl_gen_statem.erl:821 generated CLIENT ALERT: Fatal - Unexpected Message
 - {:unexpected_msg,
 {:internal,
  {:encrypted_extensions,
   %{
     elliptic_curves: {:supported_groups,
      [:secp521r1, :secp384r1, :secp256r1, :x25519, :ffdhe2048]}
   }}}}

That's when you'll know you've been thrown to the wolves. 🤬 If you change those http_options to set middlebox_comp_mode to false it will work as expected.