I Wail, For My Tailscale Fails-- How My Packets Got Dropped Beyond the Pale

38 min read Original article ↗

How My Packets Got Dropped Beyond the Pale

An innocent start: setting up ollama for autocomplete

I don't use autocomplete much these days, so I unsubscribed from that service. However, I do need to tab tab tab once a week for something simple, and usually not that complex. Thankfully, a lot of competitors have caught up to deliver a similar feature to a satisfactory level, with a self-hosted model to boot. This is a really particular edge case for me because I do have a Windows PC with a 4090 sitting around accumulating dust when I'm coding on my mac, so running a modern model for small contexts is totally feasible.

I have to do this all on WSL, Windows Subsystem for Linux, which is a virtualization of Linux on Window's Hyper-V hypervisor. Well, I don't have to, but I won't deign to use powershell or batch to install and manage a program on native Windows 11. Luckily, I already had this installed. Now, it's as easy as:

  1. Downloading Ollama ("package" manager but for models)
  2. Downloading a few models tuned for autocomplete (Qwen coding)
  3. Adding the ollama inference endpoint to the VSCode continue extension's config

Another Brick in the Wall

The autocompletions had a high variance of latency, so when I tried to open my grafana dashboard to dig deeper, I'm hit with this:

wsl is my magicDNS shorthand for the wsl device on my tailnet, which is running ollama

This surprised me, because I set up the dashboard by hand on the WSL so I know it worked at some point. Sure enough when I go to my PC and open localhost:3033:

Grafana loads locally

Okay... is this a tailnet issue? But wait! When I set up ollama earlier, I ran curl to get the model name so I can shove it into the autocomplete config.

> curl http://wsl:11434/v1/models
{"object":"list","data":[{"id":"qwen2.5-coder:32b","object":"model","crea...}

So tailnet baseline is working at least in some cases. Can I curl the grafana dashboard?

> curl http://wsl:3033/
<a href="/login">Found</a>.

Some observations:

  1. It works
  2. It was blazingly fast
  3. It's obviously not the home page, it looks like a tiny 302 to login.

At this moment, the Empiricist empiricist and the Intuitionist intuitionist struggle for the steering wheel:

Empiricist Empiricist[1]: What works? What doesn't work? Payload size? Protocol? Tailnet? Public IP? Local? Content type? Run an experiment with all these factors permutated with control groups and trial groups.

Intuitionist Intuitionist[2]: What's going on at every layer up to the bytes on the wire? Can we see the packets to understand what's going on at the bit level? What exactly is the browser receiving when it gets that error? And why does it work for curl?

If I learned anything from Tequila Sunset, it's that I have to be the one to call the shots and make these 2 work together. I decided on making more curl calls with different parameters but also trying to use these to make a mental model as quickly as possible.

These two are always fighting

Let's curl for login and measure more things:

> curl -m 10 -L -o /dev/null -w '%{size_download} bytes in %{time_total}s\n' http://wsl:3033/login
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
curl: (28) Connection timed out after 10145 milliseconds
0 bytes in 10.145160s

Okay... interesting. That's very interesting.

  1. Not a 404, just timed out in 10s
  2. No bytes received, none in transit
  3. Recall we're able to visit this page on the browser on the machine locally

Let's sanity check that the root page still works and get some more info:

curl -m 10 -o /dev/null -w '%{size_download} bytes in %{time_total}s\n' http://wsl:3033/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    29  100    29    0     0    188      0 --:--:-- --:--:-- --:--:--   189
29 bytes in 0.153534s

Yes it still works, and it's only 29 bytes-- is it a size thing? I know we can send specified number of bytes via ping, let's just use that:

> ping -c 3 -s 29 wsl
PING redacted_hostname (redacted_ip): 28 data bytes
36 bytes from redacted_ip: icmp_seq=0 ttl=64 time=77.566 ms
> ping -c 3 -s 55000 wsl
# the login page on the browser ended up being around 55kb
PING redacted_hostname (redacted_ip): 55000 data bytes
ping: sendto: Message too long
ping: sendto: Message too long
Request timeout for icmp_seq 0
ping: sendto: Message too long
Request timeout for icmp_seq 1

It times out, and we get an interesting message: sendto: Message too long. This reaffirms my suspicions that we're on the right track and it's size related. The 2 pings also tell me that somewhere between 29 bytes and 55kb, we cross some kind of threshold that breaks the request. The number might be important-- it could be a magical constant that's easily googlable or verifiable. Let's find it using an algorithm I haven't coded since college:

import subprocess, sys

host, lo, hi = sys.argv[1], int(sys.argv[2]), int(sys.argv[3])

probe = lambda n: subprocess.run(
    ["ping", "-c1", "-s", str(n), "-W", "2000", host],
    capture_output=True
).returncode == 0

# binary search
while lo + 1 < hi:
    mid = (lo + hi) // 2
    lo, hi = (mid, hi) if probe(mid) else (lo, mid)
    print(f"  {mid}: {'OK' if lo == mid else 'FAIL'}")

print(f"\nthreshold: {lo}b works, {hi}b fails")
> python3 size_search.py wsl 28 55000
  27514: FAIL
  13771: FAIL
  6899: OK
  10335: FAIL
  8617: FAIL
  7758: OK
  8187: FAIL
  7972: OK
  8079: OK
  8133: OK
  8160: OK
  8173: OK
  8180: OK
  8183: OK
  8185: FAIL
  8184: OK

threshold: 8184b works, 8185b fails

8184 seems suspiciously close to 2**13 = 8192, but my Encyclopedia Encyclopedia[3] roll failed. But we'll keep it in mind as we continue our investigation. At this point, Intuitionist wanted a stab at it:

Let's see the goods

We'll use a shark to see the contents on the wire when we ping 8184 bytes and 8185 bytes. There must be a hint in there somewhere.

# In one terminalping -c 1 -s 8184 wsl

# In another
sudo tshark -i utun0 icmp
Capturing on 'utun0'
    1   0.000000 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=0, ID=ab1b)
    2   0.000070 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=1256, ID=ab1b)
    3   0.000312 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=2512, ID=ab1b)
    4   0.000358 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=3768, ID=ab1b)
    5   0.000393 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=5024, ID=ab1b)
    6   0.000421 Mac → WSL IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=6280, ID=ab1b)
    7   0.000446 Mac → WSL ICMP 680 Echo (ping) request  id=0xd943, seq=0/0, ttl=64
    8   0.028338 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=0, ID=8e16)
    9   0.028353 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=1256, ID=8e16)
   10   0.031461 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=2512, ID=8e16)
   11   0.031472 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=3768, ID=8e16)
   12   0.032734 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=5024, ID=8e16)
   13   0.032740 WSL → Mac IPv4 1280 Fragmented IP protocol (proto=ICMP 1, off=6280, ID=8e16)
   14   0.034452 WSL → Mac ICMP 680 Echo (ping) reply    id=0xd943, seq=0/0, ttl=64 (request in 7)

So we send 1 ping and get 1 reply, and they're broken up into packets of max 1280 bytes, totaling 8360 bytes. The difference between 8360 and 8184 must be the sum of all the overhead bytes. It really bothered me that the cumulative bytes are not cleanly divisible by 7, the total number of packets. Perhaps the overhead bytes are different for the last packet? Intuitionist had a million more questions that would have taken us more than awry from our set course, but I steer us back onto our original path. Let's run it again with 1 more byte, since that was our measured cutoff.

# In one terminalping -c 1 -s 8185 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 8185 data bytes
ping: sendto: Message too long
^C
--- wsl.tailXXXXXXX.ts.net ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

# In another
> sudo tshark -i utun0 icmp
Capturing on 'utun0'

# ... no output

Huh? We don't even make it to the wire this time. So we don't even make a syscall to send the message to the socket? Let's validate that assumption:

> sudo dtruss ping -c 1 -s 8185 wsl 2>&1 | rg sendto
sendto_nocancel(0x5, 0x9DAC00AA0, 0x1C)          = 28 0
sendto_nocancel(0x5, 0x1051CC210, 0x44)          = 68 0
sendto_nocancel(0x5, 0x9DAC00AA0, 0x1C)          = 28 0
sendto(0x3, 0x104B78014, 0x2001)                 = 8193 0

> sudo dtruss ping -c 1 -s 8184 wsl 2>&1 | rg sendto
sendto_nocancel(0x5, 0xC1EC00AA0, 0x1C)          = 28 0
sendto_nocancel(0x5, 0x100A84210, 0x44)          = 68 0
sendto_nocancel(0x5, 0xC1EC00AA0, 0x1C)          = 28 0
sendto(0x3, 0x100268014, 0x2000)                 = 8192 0

Hmm this is not the output I was expecting. Why are we making sendto syscalls but seeing nothing on the wire? We make syscalls even if the payload size is too big, really? Let's try again with a ridiculous number:

> sudo dtruss ping -c 1 -s 9999999 wsl 2>&1 | rg sendto
# no output

This is what I was expecting. Hold on, then there's Some Secret Third Thing that still fails at sending the ping, but toes the line between trying to make the syscall or not? I wanted to see the entire syscall progression without rg (grep alt), and a few familiar keywords called out to me:

> sudo dtruss ping -s 8185 wsl
SYSCALL(args)            = return
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 8185 data bytes
getuid(0x0, 0x0, 0x0)            = 0 0
socket(0x2, 0x3, 0x1)            = 3 0
getuid(0x0, 0x0, 0x0)            = 0 0
setuid(0x0, 0x0, 0x0)            = 0 0
getuid(0x0, 0x0, 0x0)            = 0 0
write_nocancel(0x1, "Request timeout for icmp_seq 0\n\0", 0x1F)          = 31 0

We ask the system what user are we, and we get 0: anyone should be able to guess what user could possibly have such a perfect number. It's root. So we're calling ping as root, but I didn't call sudo ping! Or at least, not intentionally-- I ran dtruss as sudo because you obviously need escalated privileges to peer into syscalls:

dtruss ping -s 8185 wsl
dtrace: failed to initialize dtrace: DTrace requires additional privileges

So that means, if we run ping as root, we can get away with bigger packets than non root? This sounds right. I let Intuitionist take over here for a bit to figure this out, while Intuitionist bangs at the confines of its temporary jail.

> ping -c 1 -s 8185 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 8185 data bytes
ping: sendto: Message too long

> sudo ping -c 1 -s 8185 wsl
Password:
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 8185 data bytes
8193 bytes from 100.115.172.35: icmp_seq=0 ttl=64 time=295.612 ms

Suspicions confirmed. So how far can I push ping on sudo?

❯ sudo python3 payload_size_limit_search.py wsl 28 100000
  50014: OK
  75007: FAIL
  62510: OK
  68758: FAIL
  65634: FAIL
  64072: OK
  64853: OK
  65243: OK
  65438: OK
  65536: FAIL
  65487: OK
  65511: FAIL
  65499: OK
  65505: OK
  65508: FAIL
  65506: OK
  65507: OK

threshold: 65507b works, 65508b fails

Another proximity to a magical constant: 2 ** 16 = 65536, 28 bytes away. That's bigger than the diff between the previous non-sudo pairing though, so not the same here-- but this time my Encyclopedia roll succeeded: this is the maximum size of an entire IP packet, regardless of the protocol.

For IPv4, the IP header is 20 bytes. 65,535 (IP max) - 20 (IP header) = 65,515 bytes, 8 bytes away from our maximum packet size. And I know the next layer in this network stack after the Internet Layer is the Transport Layer. You know what transport protocol ping uses? ICMP, which has exactly 8 bytes of overhead.

There's probably something about how root is allowed to have a bigger socket size than non root, but we can come back to this later. I know this isn't relevant to our issue because we didn't run curl as root, so we're hitting the non socket buffer size related wall.

Let's see the syscalls for non-root ping, attaching a process ID instead so we can run and trace ping as a non-root user.

ping -c 5 -s 8185 wsl & sudo dtruss -p $! 2>&1 | rg sendto
[1] 30925
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 8185 data bytes
ping: sendto: Message too long
ping: sendto: Message too long
Request timeout for icmp_seq 0
sendto(0x3, 0x100C28014, 0x2001)                 = -1 Err#40
write_nocancel(0x2, "sendto\0", 0x6)             = 6 0
ping: sendto: Message too long
Request timeout for icmp_seq 1

Nice, check out that error from sendto: -1 Err#40. From the trusty dusty manual:

man errno | rg 40 -C 1

     40 EMSGSIZE Message too long. A message sent on a socket was larger than
             the internal message buffer or some other network limit.

So, that's unfortunate that it's not specific on what kind of limit we're hitting.

  1. Even when we fragment packets, there's a max kernel-level payload size limit we hit
  2. It's bigger if we ping as root

I don't see any more crumbs to follow.

At this point, I started experimenting with how much data I could send from the WSL side.

> ping -s 1150 droplet
PING droplet.tailXXXXXXX.ts.net (100.97.107.120) 1150(1178) bytes of data.
1158 bytes from droplet.tailXXXXXXX.ts.net (100.97.107.120): icmp_seq=1 ttl=64 time=66.3 ms

I couldn't stop thinking about learning how the packets were auto-fragmented. What happens if I force it to not fragment?

> ping -s 1300 -M do droplet
PING droplet.tailXXXXXXX.ts.net (100.97.107.120) 1300(1328) bytes of data.
ping: local error: message too long, mtu=1280

Oh? We got more information from the same error message as the Mac side: mtu=1280. Encyclopedia vaguely recalls this is about packet size limits, so let's just see if we can send a payload right under this value.

ping -s 1252 -M do droplet
PING redacted_tailnet_host (redacted_ip) 1252(1280) bytes of data.
ping: local error: message too long, mtu=1200
ping -s 1200 -M do droplet
PING redacted_tailnet_host (redacted_ip) 1200(1228) bytes of data.
ping: local error: message too long, mtu=1200
ping: local error: message too long, mtu=1200
^C
--- redacted_tailnet_host ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1082ms

➜ ping -s 1199 -M do droplet
PING redacted_tailnet_host (redacted_ip) 1199(1227) bytes of data.
ping: local error: message too long, mtu=1200

So under MTU doesn't work as well, most likely that the MTU is at a lower layer of abstraction than the one used to pad the bytes on ping, so it's ping payload size + overhead size. This is the point where Intuitionist Intuitionist busted out of its cage: I needed to know what this was and how it worked. After an hour of wandering around the web:

MTU (maximum transmission unit) is the largest packet size an interface will send (not receive). It's configurable, but there are some standardized defaults that vary across OS. When the kernel passes a packet (assuming it's under the socket size) to the NIC, it'll split up the packet if it is bigger than the MTU set for that interface-- unless -D is set, in which case it will just be rejected with an ERR40. The NIC used in each node along the network path has its own MTU-- so that means there's a "weakest link" MTU that is the bottleneck for the entire path. This is called the path MTU.

Could it be that this is the limit we're looking for? Let's grab it on the WSL:

> ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1280 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:15:5d:f5:f5:4c brd ff:ff:ff:ff:ff:ff
20: tailscale0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 500
    link/none

Right right right, I recall now that the way tailscale works vaguely is that it sends a packet to a virtual interface, and then WireGuard wraps/encrypts it, and then sends it to the main NIC (in this case eth0). To recap the onion layers:

    ┌──────────────────────────────────────────────────────────────┐
    │ Outer IP Header (eth0 starts here)                 20 B      │
    │ ┌────────────────────────────────────────────────────────┐   │
    │ │ UDP Header                                        8 B  │   │
    │ │ ┌──────────────────────────────────────────────────┐   │   │
    │ │ │ WireGuard Header                           32 B  │   │   │
    │ │ │ ╔════════════════════════════════════════════╗   │   │   │
    │ │ │ ║ encrypted (tailscale0 starts here)         ║   │   │   │
    │ │ │ ║ ┌────────────────────────────────────────┐ ║   │   │   │
    │ │ │ ║ │ Inner IP Header                  20 B  │ ║   │   │   │
    │ │ │ ║ ├────────────────────────────────────────┤ ║   │   │   │
    │ │ │ ║ │ ICMP Header                       8 B  │ ║   │   │   │
    │ │ │ ║ ├────────────────────────────────────────┤ ║   │   │   │
    │ │ │ ║ │ Payload                           N B  │ ║   │   │   │
    │ │ │ ║ └────────────────────────────────────────┘ ║   │   │   │
    │ │ │ ╚════════════════════════════════════════════╝   │   │   │
    │ │ └──────────────────────────────────────────────────┘   │   │
    │ └────────────────────────────────────────────────────────┘   │
    └──────────────────────────────────────────────────────────────┘

In any case, 1280 bytes of tailscale0 is the local path MTU on WSL. Let's watch both interfaces on WSL (tailscale0eth0) as we ping it from my Mac:

# juicepack in  macminiping -c 30 -W 3000 -s 1220 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 1220 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

Old news.

> sudo tshark -i tailscale0 -a duration:15 icmp
Running as user "root" and group "root". This could be dangerous.
Capturing on 'tailscale0'
    1 0.000000000  100.92.8.62 → 100.115.172.35 ICMP 1248 Echo (ping) request  id=0x1efa, seq=0/0, ttl=64
    2 0.000064472 100.115.172.35 → 100.92.8.62  ICMP 1248 Echo (ping) reply    id=0x1efa, seq=0/0, ttl=64 (request in 1)
    3 1.076146346  100.92.8.62 → 100.115.172.35 ICMP 1248 Echo (ping) request  id=0x1efa, seq=1/256, ttl=64
    4 1.076206459 100.115.172.35 → 100.92.8.62  ICMP 1248 Echo (ping) reply    id=0x1efa, seq=1/256, ttl=64 (request in 3)
    5 2.070698932  100.92.8.62 → 100.115.172.35 ICMP 1248 Echo (ping) request  id=0x1efa, seq=2/512, ttl=64
    6 2.070753872 100.115.172.35 → 100.92.8.62  ICMP 1248 Echo (ping) reply    id=0x1efa, seq=2/512, ttl=64 (request in 5)
    7 3.118225397  100.92.8.62 → 100.115.172.35 ICMP 1248 Echo (ping) request  id=0x1efa, seq=3/768, ttl=64
    8 3.118287102 100.115.172.35 → 100.92.8.62  ICMP 1248 Echo (ping) reply    id=0x1efa, seq=3/768, ttl=64 (request in 7)
8 packets captured

tailscale0 is responding! At least, it thinks it's responding (to eth0), because obviously on the Mac side we didn't receive anything back. One packet in, one packet out. Tit for tat. Each 1248 bytes.

> sudo tshark -i eth0 -a duration:15 "host 192.168.4.21 and not port 41641 or (host 192.168.4.21 and len > 500)"
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
    1 0.000000000 192.168.4.21 → 172.28.10.118 WireGuard 1322 Transport Data, receiver=0x08C609CF, counter=5, datalen=1248
    2 0.000584128 172.28.10.118 → 192.168.4.21 IPv4 1290 Fragmented IP protocol (proto=UDP 17, off=0, ID=ed2f)
    3 0.000593567 172.28.10.118 → 192.168.4.21 WireGuard 66 Transport Data, receiver=0x0BBE8220, counter=5, datalen=1248
    4 1.075943019 192.168.4.21 → 172.28.10.118 WireGuard 1322 Transport Data, receiver=0xD8479C8E, counter=1, datalen=1248
    5 1.076814372 172.28.10.118 → 192.168.4.21 IPv4 1290 Fragmented IP protocol (proto=UDP 17, off=0, ID=edc3)
    6 1.076828835 172.28.10.118 → 192.168.4.21 WireGuard 66 Transport Data, receiver=0x096E02A4, counter=0, datalen=1248

This is fascinating. So from eth0's perspective, it's 1 packet in, but two packets out. And the packet in is 1322 bytes, not 1248 bytes. Lastly, the packets out do not cumulatively add up to 1322, instead it adds up to 1356 bytes. So:

  1. eth0 is operating at a lower level of abstraction than tailscale0, which adds additional overhead. As per the onion diagram above, this is what we expected.
  2. packets leaving eth0 are being fragmented by the kernel because they are bigger than the MTU.
# on the macminisudo tshark -i en1 -a duration:15 "udp port 41641 or ip[6:2] & 0x3fff != 0"
Capturing on 'Wi-Fi: en1'
<crickets>

So

  1. wsl:tailscale0 is receiving a ping from macmini, and, in its opinion, successfully replying to wsl:eth0
  2. wsl:eth0 is getting slighty bigger packets, just big enough to have to break them up, but sending those, in its opinion, successfully
  3. ✨WSL ➜ Windows✨
  4. ✨The Router and Beyond✨
  5. macmini:en1 is not receiving anything, not even fragments.

Somebody is asleep at the wheel. The worst part is none of the layers think there's a real problem, but clearly the chain is broken somewhere. This is the sad part of the fire and forget network paradigm which is the core of how packets travel.

I have to keep peeling apart this onion: "tailscale0 to eth0 to router" is not good enough anymore. I know that WSL is a virtualization on top of Windows, so the packets must get crossed over somewhere. Where is wsl:eth0 sending the data?

# Trace the resolved IP (macmini is on same router as wsl, so just trace the local LAN IP directly).
# What's the route to get to my local macmini device?
ip route get 192.168.4.21
192.168.4.21 via 172.28.0.1 dev eth0 src 172.28.10.118 uid 1000

What's 172.28.0.1? Not sure, let's check the interfaces from the windows side:

j@JUICE C:\Users\j>ipconfig
...
Ethernet adapter vEthernet (WSL (Hyper-V firewall)):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::95d3:f237:f4f:6383%51
   IPv4 Address. . . . . . . . . . . : 172.28.0.1
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :

Bingo: Hyper-V is the hypervisor that runs on Windows to run WSL. There's more layers to this, but for now let's just see if it's dropping any packets:

j@JUICE C:\Users\j>pktmon start --capture --comp all --type drop

# Ping from mac

j@JUICE C:\Users\j>pktmon stop
Flushing logs...
Merging metadata...

j@JUICE C:\Users\j>pktmon counters

vSwitch: WSL (Hyper-V firewall)
 Id Name                       Counter  Direction     Packets          Bytes | Direction     Packets          Bytes
 -- ----                       -------  ---------     -------          ----- | ---------     -------          -----
 73 TCPIP                      Drops    Rx                 10          6,640 | Tx                  0              0

Woo! A smoking gun that the Hyper-V NAT is dropping packets. At first I thought this was wrong because of TCPIP, but turns out that's just the name of the Windows networking driver and has nothing to do with TCP, confusingly, as it handles all protocols.

We can grab the packet logs from pktmon:

PS C:\Users\j> Get-Content drops.txt | Select-String '172.28' | Select-Object -First 20

Drop: ip: 172.28.10.118.41641 > 192.168.4.21.60582: UDP, bad length 1280 > 1248

UDP Bad length! Encyclopedia recognizes this as well-- this is when the size header says the payload is different from what the payload actually ended up being. The packet handler drops it as a result because it's plausibly corrupted. Are we cutting off the packet somewhere in transit between eth0 and the Hyper-V NAT? Let's look at the raw packets:

I'm hacking the mainframe

Okay, there's a lot going on here, but I've highlighted what I found interesting from top to bottom visually:

  1. The length of the 2 fragmented packets match what we saw above in tshark
  2. At the IP layer, the packet has metadata about the fragments, including the offset. There's just 2, and this is the first one and so it's 0.
  3. In the IP header, there's a protocol header which asserts that this is a UDP packet
  4. Wireshark is letting us know that it reassembled the fragments, I just have to go to click the next packet to see the whole thing
  5. Wireshark can't, or won't, parse the rest of the data in this view, even though we know it's a UDP payload, because it knows it would just see a fragment of it anyways

Let's see the reassembled fragments!

Packets, assemble

Wow, WireShark is awesome. It can correctly parse the rejoined packet as containing a UDP packet, and we can actually see transport-level metadata like the length now-- 05 08 (1288) in the hex view. But pktmon earlier showed 1280 > 1248 → bad length, not 1288 and 1256. But wait-- they're the same distance apart, because both pairs of numbers are transposed by 8-- the exact number of bytes in a UDP header. pktmon must have stripped off the headers and adjusted the size comparison to just the payload.

All the signs point to the NAT trying to parse the first fragment in isolation, just like pktmon did, but observing an invariant violation due to the differences in asserted and actual UDP payload sizes, and deeming the packet as corrupt and just dropping it. But I have so many questions now:

  1. Why doesn't Hyper-V NAT just collect all the fragments and reassemble? It knows it's a fragment, it's right there in the IP header, which it can read!
  2. Why does it even need to observe and enforce this length invariant violation, it should just do what it's told and pass along the packets!
  3. So this means that this isn't even related to tailscale at all-- if we send a UDP packet from WSL that's > the MTU, the WSL kernel will also split it up and we'll hit the same issue?
  4. HOW MANY MORE ONION LAYERS MUST I PEEL UNTIL I SEE THE ANSWER MY EYES ARE SO WATERY I AM CRYING I

Let's validate #3 really quickly:

# WSL: Send 'A' * 1252 to macmini via UDP, which is our max size as per MTU
➜ python3 -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM); s.sendto(b'A'*1252,
  ('192.168.4.21', 42069))"

# WSL: Send 'B' * 1253 to macmini via UDP, which is bigger than the MTU
➜ python3 -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM); s.sendto(b'B'*1253,
  ('192.168.4.21', 42069))"

# WSL: Send 'C' * 1252 to macmini via UDP, back to legal size.
juice in ~ at wsl …
➜ python3 -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM); s.sendto(b'C'*1252,
  ('192.168.4.21', 42069))"

On the mac side:

nc -u -l 42069
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

❯ nc -u -l 42069
# No "B" received, just waited until C arrived.
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Another affirmation that we're absolutely on the right track here-- the problem is with how Hyper-V NAT just cannot handle fragmented packets. Any packets from the WSL virtualization side gets broken up by WSL because they are bigger than the eth0 MTU. That means I should be able to fix this in theory by setting the eth0 MTU to something higher, so that it never fragments packets from tailscale0.

$$ \begin{aligned} \texttt{eth0\_mtu} &\geq \texttt{tailscale0\_mtu} + \texttt{wireguard\_hdr} + \texttt{udp\_hdr} + \texttt{ip\_hdr} \\ &\geq \texttt{tailscale0\_mtu} + 32 + 8 + 20 \\ &\geq \texttt{tailscale0\_mtu} + 60 \\ &\geq 1280 + 60 \\ &\geq 1340 \end{aligned} $$

Conversely, you can also pin eth0_mtu and solve for tailscale0_mtu, which would get you 1220. Let's test both!

ip -j link show | jq '.[] | select(.ifname == "eth0" or .ifname == "tailscale0") | {name: .ifname, mtu:
  .mtu}'
{
  "name": "eth0",
  "mtu": 1280
}
{
  "name": "tailscale0",
  "mtu": 1280
}sudo ip link set eth0 mtu 1340

On the macmini

ping -c 5 -W 3000 -s 1220 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 1220 data bytes
1228 bytes from 100.115.172.35: icmp_seq=0 ttl=64 time=118.578 ms
1228 bytes from 100.115.172.35: icmp_seq=1 ttl=64 time=8.221 ms

Nice. Let's set it back to make sure the cause and effect is isolated:

sudo ip link set eth0 mtu 1280

Then from macmini:

ping -c 5 -W 3000 -s 1220 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 1220 data bytes
1228 bytes from 100.115.172.35: icmp_seq=0 ttl=64 time=58.571 ms
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

Great, now change the tailscale0 mtu instead:

sudo ip link set tailscale0 mtu 1220

from the macmini

ping -c 5 -W 3000 -s 1220 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 1220 data bytes
1228 bytes from 100.115.172.35: icmp_seq=0 ttl=64 time=10.608 ms
1228 bytes from 100.115.172.35: icmp_seq=1 ttl=64 time=8.182 ms
1228 bytes from 100.115.172.35: icmp_seq=2 ttl=64 time=77.913 ms

Woo! Now, to loop it all the way back, I should be able to load that website, right? Right???

lfg

My problem is solved, and I could've just moved on with my life here. But all those questions I enumerated above were nagging at me. I can't let this go. I must know how this all works to a satisfactory level of abstraction and detail.

Unfortunately at this point, my biggest nightmare has come true. I must learn more about NAT, HyperV, and packet lifecycles to progress any further in this investigation.

Post Doc

You might have asked, why didn't you look this up? Surely people would have run into this issue as well? And to those rash, insolent readers I say:

yeah you're right. I found a lot of the answers to my questions almost immediately. But haven't you read The Hobbit? The Odyssey? Moby Dick? Don Quixote?? The Princess Bride??? It's about the journey, not the destination. Anyways, here's the destination:

We all remember this journey

They knew

I did the best I could to distill as much of the information I learned, but this is all relatively new to me. If something's wrong or incorrect please send me an email via the page footer. This should be apparent by now, but I'm not a network engineer! This is just my personal understanding and I'm happy to be proven wrong and learn.

This is, as many would have figured by now, a not-so-new and well-known issue with running tailscale on WSL2[4]. Tailscale fixed it by detecting if it's running on WSL and bumping the MTU of eth0 to 1360 if so [5]. I'm not sure why they chose 20 extra bytes, maybe they wanted to future proof or something. We saw the raw packet headers and tested the less conservative number ourselves, and it worked. Maybe for IPv6, which uses 20 more bytes? Or my mental model is completely wrong and there's more overhead than I thought, conditionally, which we didn't hit during our testing.

MSS Clamping

In addition, they also employ something called MSS Clamping[6]. MSS stands for Maximum Segment Size, the maximum data segment for a TCP packet, and it's negotiated between hosts during SYN-ACK. But wait, don't they use UDP? Well yes, but only on the outer most layer. Obviously, you can curl and browser fetch, which both use TCP, which we also tested above. The TCP packets are wrapped in encrypted WireGuard packets, and those WireGuard packets are transmitted via UDP. They don't wrap TCP with TCP because it leads to my favorite Led Zeppelin song: a Communication Breakdown[7].

So clamping here is a technique used by a firewall or router to intercept the MSS exchange so that they artificially lower the MSS value in both the SYN and SYN-ACK packets. But why not just negotiate the right MSS in the first place, why require an interception? In a centralized, abstraction-less world this is the right move, it's intuitive-- you know the right MSS, you know what's going to happen, why not just be upfront with it?

Remember that the TCP packets are encoded inside the WireGuard packet. So tailscale0 advertises MTU - TCP_IPV4_overhead = 1280 - 40 = 1240. From its perspective, that's completely correct. I can't fault it here. However, it doesn't know that later it's going to get its packets wrapped with a bunch of extra WireGuard + UDP padding. So when it sends 1240 byte TCP packets, which is MSS-negotiated and accepted, it just gets dropped and the other party is left hanging.

This means if we clamp the MSS with the additional WireGuard overhead in bytes (60, so 1240 - 60 = 1180), the browser should still be able to send TCP data even with the default 1280 MTUs on both interfaces. We can test that out real quick with a curl:

ip -j link show | jq '.[] | select(.ifname == "eth0" or .ifname == "tailscale0") | {name: .ifname, mtu:
  .mtu}'
{
  "name": "eth0",
  "mtu": 1280
}
{
  "name": "tailscale0",
  "mtu": 1280
}sudo iptables -t mangle -A FORWARD -o tailscale0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1180sudo iptables -t mangle -A OUTPUT -o tailscale0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1180
curl -o /dev/null -w '%{size_download} bytes in %{time_total}s' http://wsl:9090/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  7551    0     0    0     0      0      0 --:--:--  0:00:17 --:--:--     0

Hmm, nope. This didn't work. And I have no idea why. And honestly, I'm too tired in this journey to find out. Maybe I will claude it later.

Anyways, I don't get why we would do this when it won't work for non-TCP packets. I bet a ping also fails:

ping -c 5 -W 3000 -s 1220 wsl
PING wsl.tailXXXXXXX.ts.net (100.115.172.35): 1220 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1

Now granted, I probably use 99% TCP. And there's probably some giga-brain intelligent designer reason to use MSS clamping instead of setting MTU, assuming there's some way to make it work on WSL. But that reason is not apparent to me.

You are conntrack-ually obliged to not drop my packets

I still really wanted to know why Hyper-V spazzed out at seeing fragmented packets, and why it needs to peer into the UDP layer in the first place.

Turns out, my intuition was right. We really should be reassembling the packets, and that's exactly what Linux's conntrack kernel feature does[8].

There's no official documentation for the specs of WSL, which is very typical Microsoft. Instead, we can see that people are posting issues on the WSL github repo about this exact limitation[9], in direct contradiction to the behavior of Linux's fragmentation handling. We'll never know why they do this.

I know fragmentation handling comes with its own set of problems, such as collection timeout. But if you choose to eschew reassembly for a no-frag policy as an explicit design tradeoff, you should make that the root of the error-- not the fact that the UDP headers are wrong. The Hyper-V NAT is able to see the IP headers and know it's fragmented, and before even peering further into the packets, it should be dropped for that reason. It would have been way more straightforward if that were the case. Alas, we'll never know, as the NAT internals are closed source.

And the reason it needs to peer inside the UDP layer in the first place? This one is so obvious that it's a testament to my mental state above. It's a NAT, duh! It needs to Network Address Translate-- replace the packet's source IP field from the hypervisor address to the machine's actual address so that the receiver knows where to send it back to (since the original LAN IP is meaningless to the receiver).

Send Help

Apparently there is some form of feedback for situations similar to this at the TCP level. When the router receives a packet that has DF=1 set (which is almost true for all modern usages of TCP) that is bigger than the next hop's MTU it sends an ICMP message back to the source[10] with details on the lower MTU needed. Then, the source should readjust their MTU to the requested amount and try again. Without this, the source just keeps trying to resend packets because it's not receiving any kind of response.

  Client                           Router                         Server
    │                                 │                              │
    │── 1500-byte packet, DF=1 ──►    │         MTU 1280             │
    │                                 │◄─X too big, can't fragment   │
    │◄── ICMP Type 3 Code 4 ─────     │                              │
    │    "Next-Hop MTU: 1280"         │                              │
    │                                 │                              │
    │  TCP updates its path MTU cache:│                              │
    │  "route to Server has MTU 1280" │                              │
    │  MSS = 1280 - 40 = 1240         │                              │
    │                                 │                              │
    │── 1280-byte packet, DF=1 ──►    │ ─ 1280-byte packet ─────────►│
    │                                 │   Woohoo!                    │

TCP caches this per-destination. You can see the cache on Linux, if the message was received and handled:

ip route get 192.168.4.21
192.168.4.21 via 172.28.0.1 dev eth0 src 172.28.10.118 uid 1000
    cache

... but it wasn't. There's no cached Path MTU here. As you suspect, Hyper-V shenanigans prevents this from happening, but to their credit there's more at play here.

  ┌─────────────────────────────────────┐
  │ Outer IP header (WireGuard)         │
  │   DF = 0  ← "fragmentation is OK"   │
  │ ┌─────────────────────────────────┐ │
  │ │ Outer UDP header                │ │
  │ │ ┌─────────────────────────────┐ │ │
  │ │ │ WireGuard encryption        │ │ │
  │ │ │ ┌───────────────────────┐   │ │ │
  │ │ │ │ Inner IP header       │   │ │ │
  │ │ │ │   DF = 1  ← encrypted │   │ │ │
  │ │ │ │ Inner TCP segment     │   │ │ │
  │ │ │ └───────────────────────┘   │ │ │
  │ │ └─────────────────────────────┘ │ │
  │ └─────────────────────────────────┘ │
  └─────────────────────────────────────┘
  1. Hyper-V NAT silently drops fragments without sending any ICMP error back (and since the outer packet has DF=0, the spec doesn't even require ICMP Frag Needed here-- the kernel fragmented correctly)
  2. Even if the WSL kernel received this message, the eth0 interface is operating at the outside-of-WireGuard UDP layer of abstraction. The mechanism to handle this would be at the TCP layer.
  3. Even if there were logic to transpose the error and its fix into the inner TCP layer, the eth0 layer can't read or write to the inner TCP packet at the tailscale0 layer because it's encrypted by WireGuard.

It's a deliberate design choice on WireGuard's part to set DF=0, since it knows it's going to be delivered via UDP and it's better than getting into that error state above.

Ironically, this reasonable choice is exactly what triggers the Hyper-V bug. If WireGuard had set DF=1, the kernel would have refused to send the oversized packet out eth0 and returned an error locally-- which WireGuard could have handled.

ReCursed Center and the Unexpected Edge Case

Somehow, this circus gets even funnier. When I was writing up this blogpost at the Recurse Center on my macbook-air (so not my macmini at home), I was playing around with the MTU solution when I encountered a weird, unexpected outcome: with MTU set back to 1280 for both interfaces, I was able to load http://wsl:3033 on my browser! I was so confused, but Empiricist experimented with a bunch of scenarios:

  1. macbook-air curl to wsl also works
  2. macbook-air ping of size 4000 also works
  3. macmini still hangs at all 3 above
  4. Setting either of the MTUs to the correct level fixes the issue on macmini as expected, and no effect on macbook-air.

Huh? What's going on? Does my macbook-air have a special setting? I tried on my iPhone and I got the same results as the macbook-air. I went home later and while trying to replicate it, suddenly all my devices had the same issue as the macmini! Is it the wifi??

I went back to Recurse Center the next day, SSH'd into the WSL, and ran the tailscale ping command to both devices:

➜ tailscale ping macbook-air
pong from macbook-air (100.127.62.73) via DERP(nyc) in 19ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 25ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 13ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 19ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 21ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 37ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 18ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 65ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 20ms
pong from macbook-air (100.127.62.73) via DERP(nyc) in 24ms
direct connection not established

➜ tailscale ping macmini
pong from macmini (100.92.8.62) via DERP(nyc) in 42ms
pong from macmini (100.92.8.62) via 192.168.4.21:60582 in 191ms

Okay the macmini one makes sense to me, we've seen it above during our investigation-- the shortest path possible is via LAN and that's what it takes. But wsl can't establish a direct path to macbook-air after 10 tries, and it has to use the DERP relay[11], which is the fallback relay system used by Tailscale to securely route traffic between devices when direct peer-to-peer connections (NAT traversal) fail. But why can't we establish that direct connection?

Tailscale writes amazing, concise documentation and I was able to find the reason instantly: it's due to a property called MappingVariesByDestIP[12]. If it's set to false, we can direct connect. If it's set to true, we cannot, and must use DERP. We can check it like this

# macmini
tailscale netcheck 2>&1 | rg MappingVariesByDestIP
        * MappingVariesByDestIP: false

# macbook-air
tailscale netcheck 2>&1 | rg MappingVariesByDestIP
        * MappingVariesByDestIP: true

Ding! But what is it? And why are the values set differently? The key is in the name, which is pretty great! This flag determines whether the router uses the same external port to different hosts if you send a UDP packet from behind a NAT. If false, then the router shows our device as 73.x.x.x:34567 as the source IP when receivers view the UDP packet, no matter who receives it. This is actually what allows Tailscale to work in the first place with hole-punching: Tailscale's coordination server tells two peers in the tailnet about each other with the source_ip:port identifier, and they send a request to each other, and it will always get to the intended destination.

If the flag is true, then the router uses a different port for each destination server. So the coordination server sees 73.x.x.x:34567, tells wsl about it, and when wsl tries to send a request to macbook-air, the router drops the request because it is not the destination address associated with this port. Since this won't work, you need a DERP server to facilitate and relay communication between the two nodes. This adds latency, and even though it's encrypted, it's always better to just skip the middleman.

But why are the values different between my home router and Recurse's router? It just is! It's not some value that's emitted by the router or anything, it's a custom Tailscale thing that they determine by testing how the router implemented NAT and if holepunching works.

So obviously, the next step is to grab the Recurse router and flash a rewrite to the firmware to allow for holepunching:

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣴⠶⠚⠛⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡴⠟⠉⢻⠀⠀⠀⠀⢻⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⡾⠋⠀⠀⠀⢸⡇⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⠟⡇⠀⠀⠀⠀⢸⡇⠀⠀⠀⢸⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⠃⠀⣿⠀⠀⠀⠀⢸⡇⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⠃⠀⠀⢸⠀⠀⠀⠀⢸⡇⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⢸⠀⠀⠀⠀⢸⡇⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣿⠀⠀⠀⠀⢸⠀⠀⠀⠀⢸⠀⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡟⠀⠀⠀⠀⢸⠀⠀⠀⠀⣸⠀⠀⠀⠀⣾⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣇⠀⠀⠀⠀⢸⠀⠀⠀⠀⣿⠀⠀⠀⠀⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⠀⠀⠀⠀⢸⠀⠀⠀⠀⡇⠀⠀⠀⢰⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⣸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⡆⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⢀⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣿⡀⠀⠀⠀⠀⠀⠀⠀⠒⠲⠶⢯⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⣤⠶⠶⠶⣤⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡴⠚⠉⠸⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠻⢦⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⡶⠛⠉⠁⠀⠀⠀⠀⠈⣧
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⡾⠋⠀⠀⠀⠀⢹⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⢀⣴⠞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⡾⠋⠀⠀⠀⠀⠀⠀⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣆⠀⠀⠀⢀⣠⠞⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣿
⠀⠀⠀⠀⠀⠀⠀⠀⢠⡟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⣧⢀⡴⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣤⠶⠞⠋⢩⡇
⠀⠀⠀⠀⠀⠀⠀⢠⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣿⣿⣏⣠⣤⣶⣿⣿⣿⣿⣀⣤⠴⠞⠋⠉⠀⠀⠀⠀⣾⠁
⠀⢠⣤⣤⣄⣀⠀⣼⣿⣄⠀⠀⠀⠀⢠⡏⢳⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⡏⢹⡄⠀⠀⠀⠀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠉⠀⠀⠀⠀⠀⠀⠀⠀⢰⡏⠀
⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⢸⣧⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣷⣾⡇⠀⠀⠀⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡟⠀⠀
⠀⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⡆⠀⠀⠸⣿⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⠃⠀⠀⢰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⠀⠀⠀⠀⠀⠀⢀⡾⠁⠀⠀
⠀⢀⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⡶⢤⣤⣌⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣈⣡⣤⣤⣤⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⠀⠀⠀⠀⠀⣼⠃⠀⠀⠀
⢠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⠀⠀⡼⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠁⠀⠀⠀⠀⣼⠃⠀⠀⠀⠀
⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⣸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⢻⡷⠤⣤⣤⣀⣀⣀⣀⡼⠃⠀⠀⠀⠀⠀
⠀⠀⠈⠉⢻⣿⣿⣿⣿⣿⣿⢿⣿⣿⠛⢦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠋⠘⣿⣿⣿⣿⠿⢿⣿⣿⣿⣿⣿⣿⠀⢷⡀⠀⠀⠀⠀⢀⡿⠁⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠈⣿⣿⣿⣿⠟⠁⠀⠙⢻⣤⣄⣈⠙⠓⠶⠦⠤⠤⠤⠴⠶⠚⠉⠀⠀⠀⣀⣼⠿⠛⠁⠀⠀⠉⣿⣿⣿⣿⣿⠀⠈⣧⠀⠀⠀⣰⠟⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠘⠿⠛⠁⠀⠀⠀⢠⠏⠀⠈⠉⠛⠳⢶⣶⠦⢤⣤⣤⡤⠴⠶⠶⣿⠋⠉⠀⠀⠀⠀⠀⠀⠀⣿⠀⠉⠛⠋⠀⠀⠸⡆⢠⡾⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⠏⠀⠀⠀⠀⠀⠀⠀⢈⡿⠶⣤⣀⠀⠀⠀⠀⢸⡄⠀⠀⠀⠀⠀⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⢿⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⠏⠀⠀⠀⠀⠀⠀⠀⣰⡟⠁⠀⢸⡏⠙⠛⠲⠶⠾⣧⠀⠀⠀⠀⠀⠀⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⢸⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠏⠀⠀⠀⠀⠀⠀⣠⡾⠋⠀⠀⠀⢸⠃⠀⠀⠀⠀⠀⣸⣆⠀⠀⠀⠀⠀⠀⠀⢸⣧⡀⠀⠀⠀⠀⠀⠀⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢠⡶⣫⠀⠀⠀⠀⠀⣠⡾⠋⠀⠀⠀⠀⠀⡼⣀⢀⡀⠀⠀⣰⠏⠹⣆⠀⠀⠀⠀⠀⠀⢸⠀⠙⠳⣄⡀⠀⣀⠀⡘⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠹⠾⢧⣧⣤⠤⠶⠛⠁⠀⠀⠀⠀⠀⠀⠀⠛⠿⠿⠤⠶⠞⠁⠀⠀⠹⣆⠀⢠⡀⢀⠀⢸⠀⠀⠀⠀⠙⠶⠾⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠹⢤⣄⣷⣬⠷⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Just kidding! Fooled you. That would be weird as hell.

Anyways, why does using the DERP relay fix our MTU issue? Here's the zinger: connections through DERP use TCP, not UDP. Based on our learnings above, why this helps is pretty clear: TCP, unlike mr fire and forget UDP, actually knows how to handle segments and MTU on every hop. We never get to the point where Hyper-V NAT gets a fragmented packet, because eth0 properly sizes it before it crams the response into the TCP socket.

Learnings

There's a lot here, but the most prominent for me is:

  1. Hyper-V NAT sucks
  2. I should google my problems sooner . But I was indulging a journey!
  3. I feel a lot more comfortable with a few layers deeper in the network stack than what I started off with.
  4. So many things! DERP, NAT, Hyper-V, conntrack, Tailscale, network interfaces, Wireshark, and so much more.

Reach out to me at [email protected] if you have any suggestions, comments, or feedback.


  1. "Disco Elysium Skill: Logic", Disco Elysium Wiki. The rational, hypothesis-driven skill that excels at deduction and controlled experimentation. ↩︎

  2. "Disco Elysium Skill: Inland Empire", Disco Elysium Wiki. The gut-feeling, intuition-driven skill that hungers to understand the deeper "why" behind everything. ↩︎

  3. "Disco Elysium Skill: Encyclopedia", Disco Elysium Wiki. The trivia-recall skill that surfaces potentially relevant (or irrelevant) facts from memory. ↩︎

  4. "Install Tailscale on Windows with WSL 2", Tailscale Docs. Official documentation covering WSL2-specific considerations and known limitations. ↩︎

  5. "version/distro,wgengine/router: raise WSL eth0 MTU when too low #7441", tailscale/tailscale. The PR that detects WSL and bumps the eth0 MTU to 1360 to avoid the packet size mismatch. ↩︎

  6. "Troubleshooting", Tailscale Docs. Tailscale's troubleshooting guide, which recommends MSS clamping to fix TCP connections that hang on large payloads. ↩︎

  7. "Why TCP Over TCP Is A Bad Idea", Olaf Titz (archived). Classic write-up explaining how nesting TCP inside TCP causes retransmission feedback loops and catastrophic throughput collapse. ↩︎

  8. "Netfilter: Packet defragmentation", Wikipedia. How Linux's conntrack module reassembles fragmented IP packets before they reach the firewall rules. ↩︎

  9. "Can't send or receive fragmented UDP packets #6082", microsoft/WSL. Community-reported issue confirming that WSL2's virtual NIC silently drops fragmented UDP packets. ↩︎

  10. "RFC 792: ICMP Type 3, Code 4 — Destination Unreachable: Fragmentation Needed", IETF. The ICMP message type that tells the sender to reduce its packet size when a router can't forward it due to MTU limits. ↩︎

  11. "How Tailscale works: Encrypted TCP Relays (DERP)", Tailscale Blog. Explanation of Tailscale's fallback relay system used when direct peer-to-peer NAT traversal fails. ↩︎

  12. "Device connectivity: Mapping varies by destination IP address", Tailscale Docs. Why some NAT types prevent direct connections, forcing traffic through DERP relays instead. ↩︎