DNS is for people - not for IT infrastructure

The Domain Name System exists because it's difficult for people to remember IP addresses (185.15.59.224) and much easier to remember domain names (wikipedia.org).

Regarding internet-accessible services, it makes sense to publish websites, API endpoints or similar services using DNS, as people have to interfact with them. The added benefit of a domain name is that the associated IP address can change without the client being affected.

This article isn't against DNS for public services, but it questions if we should use DNS for internal IT infrastructure (independent of cloud vs. onprem)

It's always DNS

Although DNS can be a very beneficial service, it can also become a liability. If you want a reliable system, you want as little components as possible. Every additional component adds a potential risk of failure. In addition, more components may create unforeseen behaviour and interactions that can cause outages (circular dependancies, and so on). If you can avoid adding components, you'll have a better chance of building a reliable system.

Within the IT operations space, DNS has made a bit of a name for itself. Many may remember this little haiku.

It’s not DNS
There’s no way it’s DNS
It was DNS

(source)

There are multiple(1) high-profile(2) incidents where DNS was involved. In these linked cases, the root-cause of the incident isn't the DNS system itself. Yet, because the root-cause affects the DNS service - which is in the critical path for virtually all services - the incident has such a huge impact.

The Facebook / Meta outage was so significant because it locked people out of buildings (physical access) due to 'circular' dependancies on DNS being available. Again, it can be said that the circular dependancy is the root-cause, but the blast radius of DNS is in many cases so enormous that it may be difficult to have a clear end-to-end picture of potential risk.

The case against DNS for internal IT infrastructure

From the perspective of IT operations, DNS has a drawback: DNS clients cache DNS records based on TTL. Different DNS client implementations can behave differently, but even if you have a fairly homogenous environment, the only way to assure clients (in this case other servers) use the updated IP address, is to control them and force a DNS refresh.

That got me thinking, why would we use DNS for infrastructure services? It isn't necessary for machine-to-machine communication. Instead of configuring domain names that may not resolve, we can just directly inject the appropriate IP address(ess) into configuration files. It's easy to configure systems with tools like Ansible or pyinfra at scale.

The counter argument could be that DevOPS / platform engineers are also humans, and it's much easier to spot misconfigurations or to troubleshoot if domain names are configured Instead of IP addresses.

Fortunately, we still have /etc/hosts, which we can easily provision. Still no DNS service required! This way, we can configure domain names and pretend to use DNS. I also suspect that DNS queries against /etc/hosts are quite responsive.

DNS as generic security risk

As of today, most network traffic is encrypted by default, or tunneled through an encrypted channel. DNS is - by default - the exception. Regarding internal IT infrastructure (cloud or 'onprem'), the network may be considered as a secure environment. An attack on the DNS service, spoofing packets, and so on, can be very disruptive though. Setting up DNSSEC may alleviate this problem, but that also introduces another administrative burden with it's own risk of misconfiguration. It's yet another layer of complexity. And we assume that internal infrastructure supports DNSSEC.

DNS as an Egress Exfiltration risk

Because egress filtering (filtering of outbound connections) can be cumbersome, it's often omitted, because the systems involved are 'trusted'. This is unfortunate as this makes life easier for an attacker. Any kind of resource required for an attack can be acquired on the vulnerable system with a simple outbound query towards the internet. Proper egress filtering of network traffic can be the difference between a succesfull and unsuccessful hacking attempt.

A lack of egress filtering also makes it much easier for an attacker to exfiltrate data. And the thing is: any IP protocol can be used to exfiltrate data, including DNS¹.

This is how: the attacker gets a domain runs their internet-accessible authoritative nameserver for this domain. Now the attacker can make DNS requests to said domain like sensitivedata.evil.domain from the hacked system and you can extract all the data from the rogue DNS server logs².

Although a hacked server may not be able to directly interact with the attacker-controlled DNS server, by issuing DNS requests for the attacker-controlled domain, these requests will pass the local forwarding DNS server and be forwarded towards the attacker-controlled authoritative DNS server. See also tools like dnscat2 or iodine

Due to this risk, there is a case to be made, to - at least - not allow systems to query public DNS records. As servers may need to interfact with services on the internet (update servers, APIs, and so on), such access can be facilitated by a proxy server using allow-listed domains.

Evaluation and closing words

In the end, everything is a tradeoff, where people must balance benefits and drawbacks against the context of their infrastructure, their particular risk appetite and even organisational structure and culture.

That said, I think it's reasonable to explore if DNS can be avoided altogether within the IT infrastructure to increase reliability and robustness.

Feel free to share your thoughts and feelings about this if you feel so inclined. Maybe leave a comment on the Hacker News thread.

Postscript on HN response (update)

Update: I do find the responses on the HN thread very interesting from an antropological perspective. How people respond and what implicit assumptions they make. One thing that I found interesting is how many people implicitly assume an environment is 'kubernetes-like', and they focus on 'service discovery'. Problems do arise if you have many IP address mutations for your infrastructure, especially at scale. Because I was thinking about very old-fashioned virtual machine or bare-metal style environment at a small to medium scale (a scale that is much more common I believe, most environments aren't Facebook/Google or even 1/100000 that size).

I realised that I should have given more context in this article, under which conditions I think my 'DNS alternative' could work well and acknowledge where it would not. Anyway, acknowledging my fault for not creating enough context, I do find the knee-jerk responses, hostility and sometimes personal attacks curious. Only very few* people seem to pause, stay calm, think about my article, and acknowledge that it could work, within specific circumstances. Also many skipped a core part of the article which was thinking about risk and complexity, in some sense it wasn't only about DNS.

I also just realise that I made a very important mistake. Servers hosting applications don't need DNS resolving for the internet. I got a feeling from some responses that I was proposing that /etc/hosts would be updated with thousands of DNS records for external internet-facing services. I was only talking about internal infrastructure where internal services, like other servers, or NTP, SMTP, or whatever is configured. I think I did address this assumption in the article, but not nearly explicitly enough.