Broken VPNs, the Year 2038, and certs that expired 100 years ago

153 points by kdp747 2 years ago · 59 comments

Reader

eszed 2 years ago

This is a great mystery story, with a satisfying ending. And this

> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.

is an approach every every one of us should internalize.

stouset 2 years ago

Binary search (or bisecting) is also an incredibly valuable approach that I don’t see junior and intermediate engineers reach for nearly as often as they should.
When some thing is failing, find a midpoint between where things are working and where the bug is manifesting. Do you see evidence of the bug? If not, look earlier in the pipeline. If so, look later. Repeat.
In my experience this process is the primary distinguisher between those who flail around looking for a root cause and the people who can rapidly come to an answer.
- eszed 2 years ago
  
  Good call. When you've got no idea where to start, that's how to start.
  Mostly, though, I think people "flail" because they don't know the pipeline well enough to even do that. I know I've been in that position before, when approaching completely new (to me) systems. (Sometimes there isn't someone more knowledgeable you can ask!) That's where I find hypothesis -> test -> refine particularly useful. You're still wrong far, far more often than you're right, but it stops feeling like flailing, and more like making progress towards understanding the system well enough to apply other techniques (whatever they might be) more smartly.
- BHSPitMonkey 2 years ago
  
  `git bisect` is one of those things I wish I'd internalized sooner in my career. It can be so incredibly powerful, especially when you just hand it a shell script (`git bisect run`) and let it rip without having to guide it by hand.
- marcus0x62 2 years ago
  
  Once someone understands a complex system well enough to find a good midpoint, are they still a junior engineer?
- EnigmaFlare 2 years ago
  
  I use this technique all the time to help people who are stuck with problems using software - often cause by bugs. Divide-and-conquer quickly isolates the issue. I try to share the technique when I use it, or just offer it as a suggestion.
  Part of why it's so useful is you hardly have understand anything about the system internally. Just reduce the complexity of what you're doing until it works to find the lower bound if you don't already have a working case.
  That random guessing is like gambling - you hope for a big quick payout but when your hypothesis fails, you end up worse off than before. Wasted time and no closer to the solution.
- paulddraper 2 years ago
  
  100%
  I've wondered why this isn't second nature to engineers, junior or otherwise.
  Maybe they don't really understand the pipeline? ("I enter the value in the web form and it just appears in the database.")
- drewzero1 2 years ago
  
  I think I kind of internalized that idea from my early soft eng courses; after seeing how efficiently a computer can find a result by cutting the set in half repeatedly, I've tried to apply that approach elsewhere when it fits.
- numtel 2 years ago
  
  I like that term. I always called this "divide and conquer."
- ztetranz 2 years ago
  
  I remember one of the Car-Talk guys use the term "binary chop" once when talking to a caller about diagnosing a problem.
- ta1243 2 years ago
  
  > I don’t see junior and intermediate engineers reach for nearly as often as they should
  Or senior engineers
Espressosaurus 2 years ago

That's just how you debug any system.
If you're on this site and haven't already internalized it...how do you debug?
devjab 2 years ago

How do you debug if this isn’t what you’re doing? I’m genuinely curious… are you using some sort of advanced tools like a psychopath?
- recursive 2 years ago
  
  1. Internet search 2. Make random changes 3. Test 4. Start over.
  - Espressosaurus 2 years ago
    
    I can't even imagine how that would work with any complicated system.
    
    recursive 2 years ago
    
    Not well. It's pretty frustrating to observe its practitioners in the wild.

m3047 2 years ago

Ran into a case where a whole datacenter became untethered from its NTP upstream and drifted off into a timezone of its own creation. Customer was failing authentication for a data product we sold them (TSIG was failing). I was on the phone with them for an hour, reassuring them constantly that everything was working for our other customers, tailing logs, and reporting what I saw.

More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.

yjftsjthsd-h 2 years ago

> Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark?
Possible. Some companies are mostly lacking in competent technical people, so anyone who knows what they're doing will quickly find themselves pulled into every possible task; I see no reason why this shouldn't include external parties.
- tetha 2 years ago
  
  Especially if things enter the very strange territory, like NTP running wild.
  I pretty much remember some time ago, one of our customers had trouble with our on-prem installation. Eventually it seemed that the database had been corrupted. At that point I could tell the poor guy on the other side was running what I was doing with my colleague on a couple of other, similar systems, but I noted he was getting pretty nervous so I figured as long as the clock runs, whatever. It's kinda what I do a lot at work and I don't like letting people in the rain like that.
  And eventually we could confirm that large amounts of VMDKs had been corrupted in various ways. Seemed like another vendor hat let the SAN they were managing run full or into some other catastrophic situation. And their backup appliance also didn't work.
- m3047 2 years ago
  
  > Possible.
  I had dealt with these people for several years. In my company, I changed teams several times.
  But I still got emails inviting me to internal meetings: they thought? assumed? hoped? ...that I was an internal consultant.
  This was pursuant to a larger project and I was on the call for that one and pointed out that I was no longer on the implementation team but was thrilled to be there... because I was. That final / initial deployment resulted in WTFs and I said "woot!" and dropped the call.
  - m3047 2 years ago
    
    > resulted in WTFs
    It was a visibility / cybersecurity product and it pretty much immediately paid for itself. ;-)

dancemethis 2 years ago

I have a very soft spot for this kind of "campfire story". Open Office not printing on Tuesdays comes to mind. Anyone got some more?

jrlocke 2 years ago

I'm partial to the 500 mile email: https://www.ibiblio.org/harris/500milemail.html
indrora 2 years ago

A tale I use in interviews is "The Homesick Laptop's Replacement Desktop That Ate Hard Drives in Summer".
long story short, Dell ship-of-theseus'd an entire machine looking for an issue that only happened on cloudy-hot days when the disks were under high load. It was an air conditioner out of phase with the rest of the system causing EMI that the power supply just let on through.
- Night_Thastus 2 years ago
  
  Reminds me of something I had to deal with in my audio system.
  I was getting an awful buzzing/static sound in my speakers. I went through my chain one at a time un-plugging them and seeing if the noise went away or didn't.
  As it turns out, my PC's video card was barfing electrical noise through every port...including the ethernet port. Unfortunately I didn't know the difference between shielded and shielded twisted pair and had used shielded by mistake.
  That shielded twisted pair allowed the noise to go out of my GPU, into the motherboard, through the ethernet port, then down to my ethernet switch. From there, the switch connected to the raspberry pi I used for streaming, where it helpfully forwarded that noise straight into the DAC and therefore the rest of the chain.
  I tell you, that drove me nuts!
WildGreenLeave 2 years ago

This story comes to mind: https://web.mit.edu/jemorris/humor/500-miles
nicbou 2 years ago

I briefly collected them on a website called "bedtime stories for engineers". I really wish that someone put in the effort to collect those stories because they're just so good
wkjagt 2 years ago

Google "My car is allergic to vanilla ice cream". I can only find rehashes of it, not the actual source, so I didn't link to it, but the basic story is easy to find.

denton-scratch 2 years ago

> I suspect the NTP server had a badly faulty internal clock which ran very fast.

A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.

zokier 2 years ago

Not only could it not tell the time, but also had catastrophic bugs in time handling and does not handle y2k38. If that was on my network, the vendor would get yeeted immediately.
thyrsus 2 years ago

I was quite aware that our (different company) time server was based on getting CDMA signal, and - oh, wait - CDMA was retired last year? Luckily, we could open our internet firewall to let it talk to external stratum 1 servers and configure it to be stratum 2, rather than the stratum 1 we bought. Replacements using GPS are in process, but are impeded by weak GPS signal in the data centers. Antennas are to be implemented...

arter4 2 years ago

They 2038 thing I get, but the clock drift of BILLIONS of seconds really scares me. What kind of fucked up setup can lead to something like this?

macintux 2 years ago

Not nearly as interesting a story: in 1996 I visited a customer who was using up for dialup services, but reported some of their Windows desktops couldn't connect.

It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.

Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.

kro 2 years ago

Last year I had a (of many) freshly provisioned Linux VMs clock change to the year 2257 2 nights in a row. Never figured that out sadly, reprovisioning "fixed" that.
- justsomehnguy 2 years ago
  
  > Ambassador Kosh's ship arrives at the Epsilon III Jumpgate two days ahead of schedule. Upon leaving his ship the Minbari assassin approaches Kosh in the guise of Jeffrey Sinclair, whom Kosh recognises as Entil'Zha Valen. When he extends a hand in greeting to his 'old friend' the Minbari slaps a skin tab dosed with Florazyne, causing Kosh to collapse and lose consciousness
  Oh no.

vdaea 2 years ago

Why does this NTP implementation accept a sudden change of 4 billion seconds? For example, the NTP implementation in Windows refuses to change the clock by more than 54,000 seconds.

robocat 2 years ago

When synchronising Windows with an external source, you can slowly correct the time using: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoa...
Using that API avoids sudden jumps in time. The cost is that if a correction is required then the system time will be incorrect until the difference settles to zero. And you ideally need some PID control so that the system time settles quickly to match the "correct" external time.
For example you can spread a 1 second adjustment over an hour. Sometimes being up to one second out is less problem than a sudden jump of one second.
It is useful to have time monotonically increasing if you have software that depends on time differences (e.g. timestamps stored in logging systems).
Not sure if Microsoft gimped the API after XP - this note seems bad "Currently, Windows Vista and Windows 7 machines will lose any time adjustments set less than 16." Make it difficult to use API to steadily keep time synchronised closely.
- pests 2 years ago
  
  Isn't this what Google dubbed as smearing in their Spanner paper?
  - rcxdude 2 years ago
    
    Kinda. Clock slewing is the term used in NTP-land and is basically about making this kind of adjustment for small time differences (generally you want to run a control loop that adjusts the rate of the internal clock so that the time difference goes to zero, as opposed to simply changing the time, as one will result in a smooth time reading while the other will cause periodic jumps). Smearing is basically pretending that the leap second doesn't exist and using clock slewing to paper over the difference.
  - robocat 2 years ago
    
    AFAIK they were using Linux, and they were still keeping all the servers synchronised (smearing at the same rate).
usr1106 2 years ago

Linux systemd-timesyncd seems to have some limit, too. When our appliance-like systems get an invalid time from the hardware clock, they boot with a system clock of 2019 IIRC. systemd-timesyncd does not correct that depite me assuming that the ntp server works correctly.
Well, it happens rarely enough and we have workaround, so the bug report still sits in the queue behind more urgent problems. Haven't read the source yet, which is what you should do on Linux of course...

cesarb 2 years ago

This reminded me of this article from last year: https://arstechnica.com/security/2023/08/windows-feature-tha... (HN discussion: https://news.ycombinator.com/item?id=37151220)

8organicbits 2 years ago

I'm not sure I see why it was revoking the certificates, when you renew a certificate that's about to expire you can just let the old one expire, right?

tialaramex 2 years ago

I'd say that more often than not people building this sort of stuff in-house have no idea what they're doing. So although that part of the design doesn't make much sense it's not astonishing to see it.
A PKI provides a deeply technical solution to a hard problem you probably don't have. This technology is most often deployed when somebody has a different, easy problem, but they don't like the relatively easy non-technical solution.
pixl97 2 years ago

This can go back to your old buddy NTP, specifically DHCP assigning this on untrusted networks. If you control the network (time?) and you manage to get the full expired certificate you may be able to MITM the victim successfully. If you force the CRL check first then things won't match up. I have no ideas on the feasibility of faking the CRL though, so it might be a wash.
- nijave 2 years ago
  
  Seems like it’d be fairly difficult in practice to change time on a host such that you can use an expired certificate without breaking a bunch of other stuff

bdw5204 2 years ago

The solution to the year 2038 problem is to upgrade your time since the Unix epoch fields to 64 bit integers. Hopefully this won't be an actual issue 14 years from now because it's such a simple fix.

wongarsu 2 years ago

About a decade ago I was involved in the development of an embedded product for industrial use cases. The kind of stuff you install once and use for 20-40 years. The library we used for displaying the time breaks around 2036 (so a bit ahead of y2k38). But the person responsible would long be in retirement by then and the issue doesn't impact critical functions, so it was decided not to do anything about it. This version of the product is still sold today. I doubt this story is uncommon.
ooterness 2 years ago

It's just like Y2K, and just as pervasive, but harder to explain to upper management. My guess is that it won't go smoothly.
- ta1243 2 years ago
  
  There's a perception that y2k was overblown because we spent tons of money and didn't have the problems that were suggested in the media
  Of course the fact that the problems were overhyped, but importantly FIXED by all that money, doesn't come into it, it was a cry-wolf situation.
  - EnigmaFlare 2 years ago
    
    The media wasn't telling us that the problems were getting fixed though. It kept on hyping the doomsday in whatever way it could. We were supposed to wake up in 2000 to find our fridge door open and melted ice all over the floor. Why? Fridges didn't have clocks that told them to turn themselves off, or clocks at all.
    
    bdw5204 2 years ago
    
    That anybody believed the media's claim that refrigerators had computers in them in the 1990s seems like a case of the Gell-Mann Amnesia effect[0]. Smart refrigerators didn't even exist until LG released one in June 2000 that was a commercial failure[1].
    [0]: https://en.wikipedia.org/wiki/Michael_Crichton#GellMannAmnes... [1]: https://en.wikipedia.org/wiki/Smart_refrigerator#History
    
    EnigmaFlare 2 years ago
    
    I remember the news telling us that all sorts of unexpected devices had computers in them these days and computer=Y2K bug risk. Microwaves certainly did so fridges would be an easy leap to make.

raffraffraff 2 years ago

Oh NTP... I remember a series of extremely annoying incidents that were caused by time skew on hundreds of Linux VMs in our data center. Our setup was typical of a startup - built to be good enough at first, and fall apart at scale.

Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.

Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.

Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).

Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.

frereubu 2 years ago

Related and fascinating article that came up on HN recently after the originator of NTP, David Mills, died: https://www.newyorker.com/tech/annals-of-technology/the-thor...

(Just turn off JavaScript to read it if you hit a paywall).

pxeger1 2 years ago

> the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.

What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??

hcs 2 years ago

Not sure, but: median certificate (so each CRL has a multiplicity of however many certificates would use it, or perhaps of how many times it is actively retrieved) vs median CRL size (each CRL listed once)
Or they meant mean for the first one, I guess.
Edit: it's the former, from the paper:
> We immediately observe that half of all CRLs are under 900 B. However, this statistic is deceiving: if you select a certificate at random from the Leaf Set, it is unlikely to point to a tiny CRL, since the tiny CRLs cover very few certificates.

Settings

Broken VPNs, the Year 2038, and certs that expired 100 years ago

Keyboard Shortcuts