I recently ran into a weird issue after switching an application's container base image from Debian to Alpine Linux. Things worked out fine in staging and sandbox environments, but suddenly in production (of course!) the application failed to resolve a service it needs to talk to.
After going through the usual checklist (is the service up, is it a temporary networking issue, is this happening all the time on all nodes, etc.) and a few hours of tearing our hair out, we identified this to be a combination of issues with the DNS implementation in Alpine Linux and the DNS service (Google DNS) we're using.
The first issue is about the lack of support of musl libc, the C standard library that powers Alpine Linux, for DNS over TCP or EDNS (Extension Mechanisms for DNS).
The second issue might be a security measure to prevent DNS amplification attacks, but I'm not entirely sure about this.
Disclaimer: I'm not a DNS expert at all. I only watched talks and read into mailing lists to dig into the issue we've been seeing and think I found a reasonable explanation.
Looking into DNS
DNS originally and still primarily talks over UDP on port 53. A DNS query is a single UDP packet, to which a DNS server sends back a DNS answer, again in a single UDP packet. However, because DNS answers are getting bigger and bigger (in our case the issues stems from running Kubernetes, with an ingress on every node of our Kubernetes cluster so a complete answer would return 30+ IP addresses), the standard was extended by two options:
- Increasing the size of the UPD packet above 512 bytes via the Extension Mechanism for DNS (EDNS)
- Switching the protocol from UDP to TCP
Alpine Linux, or rather musl libc, doesn't support either of those options. To replicate the situation we've been running into, you can either start up an Alpine docker image or use
dig with the
+noedns and the
+ignore flags. The first one will disable EDNS support, the second one disables retries via TCP in case the truncation flag was set in the initial DNS response via UDP.
As you can see, the
tc (which is short for truncation) is set, but there's no (truncated) answer section. How come?
Well, took me some time to figure this out. This section made me think: https://developers.google.com/speed/public-dns/docs/security#rate_limit
If DNS queries over UDP from one source IP address exceed the average bandwidth or amplification limit consistently (the occasional large response will pass), queries may be dropped or only a small response may be sent. Small responses may be an error response or an empty response with the truncation bit set (so that most legitimate queries will be retried via TCP and succeed). Not all systems or programs will retry via TCP, and DNS over TCP may be blocked by firewalls on the client side, so some applications may not operate correctly when replies are truncated. Nonetheless, truncation allows RFC-compliant clients to work properly in most cases.
That's exactly what we've been seeing - answers have the truncation flag set, but an empty response. So Google DNS (and other DNS providers) is essentially forcing clients to upgrade from UDP to TCP in order to not be vulnerable or enable DNS amplification attacks. Pretty clever!
How do I fix my application?
Rich Felker, the maintainer of musl libc, suggests using a DNS library in your language/environment of choice which supports DNS over TCP. In fact, that's not too hard, most languages already have this as part of their standard library or a package.
For Ruby, this is as simple as doing a
require 'resolv-replace' somewhere at the very start of your application (e.g. in your
config/application.rb file before any code is run).