Broken VPNs, the Year 2038, and certs that expired 100 years ago
theregister.comThis is a great mystery story, with a satisfying ending. And this
> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.
is an approach every every one of us should internalize.
Binary search (or bisecting) is also an incredibly valuable approach that I don’t see junior and intermediate engineers reach for nearly as often as they should.
When some thing is failing, find a midpoint between where things are working and where the bug is manifesting. Do you see evidence of the bug? If not, look earlier in the pipeline. If so, look later. Repeat.
In my experience this process is the primary distinguisher between those who flail around looking for a root cause and the people who can rapidly come to an answer.
Good call. When you've got no idea where to start, that's how to start.
Mostly, though, I think people "flail" because they don't know the pipeline well enough to even do that. I know I've been in that position before, when approaching completely new (to me) systems. (Sometimes there isn't someone more knowledgeable you can ask!) That's where I find hypothesis -> test -> refine particularly useful. You're still wrong far, far more often than you're right, but it stops feeling like flailing, and more like making progress towards understanding the system well enough to apply other techniques (whatever they might be) more smartly.
`git bisect` is one of those things I wish I'd internalized sooner in my career. It can be so incredibly powerful, especially when you just hand it a shell script (`git bisect run`) and let it rip without having to guide it by hand.
Once someone understands a complex system well enough to find a good midpoint, are they still a junior engineer?
I use this technique all the time to help people who are stuck with problems using software - often cause by bugs. Divide-and-conquer quickly isolates the issue. I try to share the technique when I use it, or just offer it as a suggestion.
Part of why it's so useful is you hardly have understand anything about the system internally. Just reduce the complexity of what you're doing until it works to find the lower bound if you don't already have a working case.
That random guessing is like gambling - you hope for a big quick payout but when your hypothesis fails, you end up worse off than before. Wasted time and no closer to the solution.
100%
I've wondered why this isn't second nature to engineers, junior or otherwise.
Maybe they don't really understand the pipeline? ("I enter the value in the web form and it just appears in the database.")
I think I kind of internalized that idea from my early soft eng courses; after seeing how efficiently a computer can find a result by cutting the set in half repeatedly, I've tried to apply that approach elsewhere when it fits.
I like that term. I always called this "divide and conquer."
I remember one of the Car-Talk guys use the term "binary chop" once when talking to a caller about diagnosing a problem.
> I don’t see junior and intermediate engineers reach for nearly as often as they should
Or senior engineers
That's just how you debug any system.
If you're on this site and haven't already internalized it...how do you debug?
How do you debug if this isn’t what you’re doing? I’m genuinely curious… are you using some sort of advanced tools like a psychopath?
1. Internet search 2. Make random changes 3. Test 4. Start over.
I can't even imagine how that would work with any complicated system.
Not well. It's pretty frustrating to observe its practitioners in the wild.
Ran into a case where a whole datacenter became untethered from its NTP upstream and drifted off into a timezone of its own creation. Customer was failing authentication for a data product we sold them (TSIG was failing). I was on the phone with them for an hour, reassuring them constantly that everything was working for our other customers, tailing logs, and reporting what I saw.
More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.
> Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark?
Possible. Some companies are mostly lacking in competent technical people, so anyone who knows what they're doing will quickly find themselves pulled into every possible task; I see no reason why this shouldn't include external parties.
Especially if things enter the very strange territory, like NTP running wild.
I pretty much remember some time ago, one of our customers had trouble with our on-prem installation. Eventually it seemed that the database had been corrupted. At that point I could tell the poor guy on the other side was running what I was doing with my colleague on a couple of other, similar systems, but I noted he was getting pretty nervous so I figured as long as the clock runs, whatever. It's kinda what I do a lot at work and I don't like letting people in the rain like that.
And eventually we could confirm that large amounts of VMDKs had been corrupted in various ways. Seemed like another vendor hat let the SAN they were managing run full or into some other catastrophic situation. And their backup appliance also didn't work.
> Possible.
I had dealt with these people for several years. In my company, I changed teams several times.
But I still got emails inviting me to internal meetings: they thought? assumed? hoped? ...that I was an internal consultant.
This was pursuant to a larger project and I was on the call for that one and pointed out that I was no longer on the implementation team but was thrilled to be there... because I was. That final / initial deployment resulted in WTFs and I said "woot!" and dropped the call.
> resulted in WTFs
It was a visibility / cybersecurity product and it pretty much immediately paid for itself. ;-)
I have a very soft spot for this kind of "campfire story". Open Office not printing on Tuesdays comes to mind. Anyone got some more?
I'm partial to the 500 mile email: https://www.ibiblio.org/harris/500milemail.html
A tale I use in interviews is "The Homesick Laptop's Replacement Desktop That Ate Hard Drives in Summer".
long story short, Dell ship-of-theseus'd an entire machine looking for an issue that only happened on cloudy-hot days when the disks were under high load. It was an air conditioner out of phase with the rest of the system causing EMI that the power supply just let on through.
Reminds me of something I had to deal with in my audio system.
I was getting an awful buzzing/static sound in my speakers. I went through my chain one at a time un-plugging them and seeing if the noise went away or didn't.
As it turns out, my PC's video card was barfing electrical noise through every port...including the ethernet port. Unfortunately I didn't know the difference between shielded and shielded twisted pair and had used shielded by mistake.
That shielded twisted pair allowed the noise to go out of my GPU, into the motherboard, through the ethernet port, then down to my ethernet switch. From there, the switch connected to the raspberry pi I used for streaming, where it helpfully forwarded that noise straight into the DAC and therefore the rest of the chain.
I tell you, that drove me nuts!
This story comes to mind: https://web.mit.edu/jemorris/humor/500-miles
I briefly collected them on a website called "bedtime stories for engineers". I really wish that someone put in the effort to collect those stories because they're just so good
Google "My car is allergic to vanilla ice cream". I can only find rehashes of it, not the actual source, so I didn't link to it, but the basic story is easy to find.
> I suspect the NTP server had a badly faulty internal clock which ran very fast.
A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.
Not only could it not tell the time, but also had catastrophic bugs in time handling and does not handle y2k38. If that was on my network, the vendor would get yeeted immediately.
I was quite aware that our (different company) time server was based on getting CDMA signal, and - oh, wait - CDMA was retired last year? Luckily, we could open our internet firewall to let it talk to external stratum 1 servers and configure it to be stratum 2, rather than the stratum 1 we bought. Replacements using GPS are in process, but are impeded by weak GPS signal in the data centers. Antennas are to be implemented...
They 2038 thing I get, but the clock drift of BILLIONS of seconds really scares me. What kind of fucked up setup can lead to something like this?
Not nearly as interesting a story: in 1996 I visited a customer who was using up for dialup services, but reported some of their Windows desktops couldn't connect.
It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.
Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.
Last year I had a (of many) freshly provisioned Linux VMs clock change to the year 2257 2 nights in a row. Never figured that out sadly, reprovisioning "fixed" that.
> Ambassador Kosh's ship arrives at the Epsilon III Jumpgate two days ahead of schedule. Upon leaving his ship the Minbari assassin approaches Kosh in the guise of Jeffrey Sinclair, whom Kosh recognises as Entil'Zha Valen. When he extends a hand in greeting to his 'old friend' the Minbari slaps a skin tab dosed with Florazyne, causing Kosh to collapse and lose consciousness
Oh no.
Why does this NTP implementation accept a sudden change of 4 billion seconds? For example, the NTP implementation in Windows refuses to change the clock by more than 54,000 seconds.
When synchronising Windows with an external source, you can slowly correct the time using: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoa...
Using that API avoids sudden jumps in time. The cost is that if a correction is required then the system time will be incorrect until the difference settles to zero. And you ideally need some PID control so that the system time settles quickly to match the "correct" external time.
For example you can spread a 1 second adjustment over an hour. Sometimes being up to one second out is less problem than a sudden jump of one second.
It is useful to have time monotonically increasing if you have software that depends on time differences (e.g. timestamps stored in logging systems).
Not sure if Microsoft gimped the API after XP - this note seems bad "Currently, Windows Vista and Windows 7 machines will lose any time adjustments set less than 16." Make it difficult to use API to steadily keep time synchronised closely.
Isn't this what Google dubbed as smearing in their Spanner paper?
Kinda. Clock slewing is the term used in NTP-land and is basically about making this kind of adjustment for small time differences (generally you want to run a control loop that adjusts the rate of the internal clock so that the time difference goes to zero, as opposed to simply changing the time, as one will result in a smooth time reading while the other will cause periodic jumps). Smearing is basically pretending that the leap second doesn't exist and using clock slewing to paper over the difference.
AFAIK they were using Linux, and they were still keeping all the servers synchronised (smearing at the same rate).
Linux systemd-timesyncd seems to have some limit, too. When our appliance-like systems get an invalid time from the hardware clock, they boot with a system clock of 2019 IIRC. systemd-timesyncd does not correct that depite me assuming that the ntp server works correctly.
Well, it happens rarely enough and we have workaround, so the bug report still sits in the queue behind more urgent problems. Haven't read the source yet, which is what you should do on Linux of course...
This reminded me of this article from last year: https://arstechnica.com/security/2023/08/windows-feature-tha... (HN discussion: https://news.ycombinator.com/item?id=37151220)
I'm not sure I see why it was revoking the certificates, when you renew a certificate that's about to expire you can just let the old one expire, right?
I'd say that more often than not people building this sort of stuff in-house have no idea what they're doing. So although that part of the design doesn't make much sense it's not astonishing to see it.
A PKI provides a deeply technical solution to a hard problem you probably don't have. This technology is most often deployed when somebody has a different, easy problem, but they don't like the relatively easy non-technical solution.
This can go back to your old buddy NTP, specifically DHCP assigning this on untrusted networks. If you control the network (time?) and you manage to get the full expired certificate you may be able to MITM the victim successfully. If you force the CRL check first then things won't match up. I have no ideas on the feasibility of faking the CRL though, so it might be a wash.
Seems like it’d be fairly difficult in practice to change time on a host such that you can use an expired certificate without breaking a bunch of other stuff
The solution to the year 2038 problem is to upgrade your time since the Unix epoch fields to 64 bit integers. Hopefully this won't be an actual issue 14 years from now because it's such a simple fix.
About a decade ago I was involved in the development of an embedded product for industrial use cases. The kind of stuff you install once and use for 20-40 years. The library we used for displaying the time breaks around 2036 (so a bit ahead of y2k38). But the person responsible would long be in retirement by then and the issue doesn't impact critical functions, so it was decided not to do anything about it. This version of the product is still sold today. I doubt this story is uncommon.
It's just like Y2K, and just as pervasive, but harder to explain to upper management. My guess is that it won't go smoothly.
There's a perception that y2k was overblown because we spent tons of money and didn't have the problems that were suggested in the media
Of course the fact that the problems were overhyped, but importantly FIXED by all that money, doesn't come into it, it was a cry-wolf situation.
The media wasn't telling us that the problems were getting fixed though. It kept on hyping the doomsday in whatever way it could. We were supposed to wake up in 2000 to find our fridge door open and melted ice all over the floor. Why? Fridges didn't have clocks that told them to turn themselves off, or clocks at all.
That anybody believed the media's claim that refrigerators had computers in them in the 1990s seems like a case of the Gell-Mann Amnesia effect[0]. Smart refrigerators didn't even exist until LG released one in June 2000 that was a commercial failure[1].
[0]: https://en.wikipedia.org/wiki/Michael_Crichton#GellMannAmnes... [1]: https://en.wikipedia.org/wiki/Smart_refrigerator#History
I remember the news telling us that all sorts of unexpected devices had computers in them these days and computer=Y2K bug risk. Microwaves certainly did so fridges would be an easy leap to make.
Oh NTP... I remember a series of extremely annoying incidents that were caused by time skew on hundreds of Linux VMs in our data center. Our setup was typical of a startup - built to be good enough at first, and fall apart at scale.
Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.
Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.
Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).
Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.
Related and fascinating article that came up on HN recently after the originator of NTP, David Mills, died: https://www.newyorker.com/tech/annals-of-technology/the-thor...
(Just turn off JavaScript to read it if you hit a paywall).
> the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.
What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??
Not sure, but: median certificate (so each CRL has a multiplicity of however many certificates would use it, or perhaps of how many times it is actively retrieved) vs median CRL size (each CRL listed once)
Or they meant mean for the first one, I guess.
Edit: it's the former, from the paper:
> We immediately observe that half of all CRLs are under 900 B. However, this statistic is deceiving: if you select a certificate at random from the Leaf Set, it is unlikely to point to a tiny CRL, since the tiny CRLs cover very few certificates.