Some recurring email bounces have tipped me off to a really strange DNS issue on the boxes I admin. I really need help on this as it's impacting a real customer in a real way: delayed emails.
Often, and somewhat randomly, DNS will fail to resolve a domain. Sometimes it will fail to resolve it for >4 hours and trigger a diagnostic bounce from my sendmail.
When I run tests at the command line (dig) on the domains in question (the ones I've seen in email bounces) I will often very quickly get a resolve failure, but usually 5s later another dig to the same domain will resolve 100% ok!
Every box I am testing on has a similar config with BIND named 9.10 (9.8 on one box) running as the local recursive resolver. /etc/resolv.conf on all is 127.0.0.1. So that means every lookup that isn't cached is going to the root NS's.
When it fails to resolve, named.log logs an entry like: 20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769 20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
For manual tests it only logs one, for the real-life sendmail problem ones, I'll see dozen/hundreds of the same thing trying hour after hour (usually around the sendmail queue retry times).
One of my boxes is *extremely* well connected in the US, and while it seems to have errors slightly less often, it still has them. All the rest are on various levels of Shaw or MTS, res or business.
This seems to have just started popping up maybe 6 months ago. It feels like it's getting worse.
I've setup an easy test, on the actual domains with the most problems:
rndc flush dig +short sportmanitoba.ca dig +short gymcan.org dig +short brandoneagles.ca dig +short interactivegym.org dig +short artscouncil.mb.ca
rndc flush dig +trace sportmanitoba.ca dig +trace gymcan.org dig +trace brandoneagles.ca dig +trace interactivegym.org dig +trace artscouncil.mb.ca
Maybe some others can run those tests on their boxes (but only if you're running BIND as caching resolver, which many/most people won't be).
Here's where it gets interesting the +short tests I can get to fail at least 1 of the domains (at random) about 1/8 of the time! On at least 5 different boxes out there! But +trace has never failed once on any box or any domain. It's like +trace does something different, maybe slowing the process down or something, that allows it to always succeed. (Failure for the +short is a missing line in output, +trace you have to look at the domain/ip returned near the bottom.)
So, is it just these particular domains?? Something wrong on their (DNS) side? Or is it more domains, not just these? Is there any way to diagnose what exactly is failing? I find it bizarre that *all* of these domains regularly go down for 4+ hours causing an email bounce!?! Or is there something horribly wrong on my BIND caching DNS servers?
Perhaps they are slow, and I'm my BINDs are just not waiting long enough. Is there a way to tell BIND to be more patient waiting for DNS packets to come in?
Maybe it's something regarding IPv6? (I'm doing this all in IPv4 and have no current interest in 6. And I'm only looking for A records.)
I have packet traces of the above sample commands when the lookups fail, but I can't really figure out what it's doing, other than one boatload of traffic for a tiny dns query. I can provide a trace privately on demand if you think you can help.
Even more odd, I never seem to have a problem with interactive things like web browsing. If this was a problem with all domains, I should see this in firefox all the time, but I don't. Maybe Firefox doesn't even obey resolv.conf and does its own thing, or retries heavily itself?
I also checked to ensure my iptables aren't dropping packets related to this.
Lastly, answers of "just use 8.8.8.8" aren't helpful because I also need to handle dynamic local, and in some cases, external DNS (often with multiple views), all in the same BIND/box (and I like uniformity across boxes for ease of admin). Sure, I could try another resolver, but I see no reason BIND can't be made to work, as it has for me for 20 years. And if this is a BIND bug, I want to submit it to help solve it.
Thanks guys!
Without taking the time to examine these carefully, I'd guess that those domains are being served off less-than-stellar DNS servers, and the fault is likely not at your end. There's a disgustingly-large % of DNS service in the wild that's outright held together with chewing gum and bailing twine... bad glue records in particular are a problem. Examine the chain of authoritative servers for each and I'll bet you find some commonalities. Also there are dozens of DNS "lint" tools that will help you track down other people's errors as well as your own. Best guess without testing: domain has 3-4 servers listed at gTLD, only 2-3 of those are authoritative for the domain, and something along the line has an illegally-short TTL. -Adam
On April 20, 2016 2:03:59 AM CDT, Trevor Cordes trevor@tecnopolis.ca wrote:
Some recurring email bounces have tipped me off to a really strange DNS
issue on the boxes I admin. I really need help on this as it's impacting a real customer in a real way: delayed emails.
Often, and somewhat randomly, DNS will fail to resolve a domain. Sometimes it will fail to resolve it for >4 hours and trigger a diagnostic bounce from my sendmail.
When I run tests at the command line (dig) on the domains in question (the ones I've seen in email bounces) I will often very quickly get a resolve failure, but usually 5s later another dig to the same domain will resolve 100% ok!
Every box I am testing on has a similar config with BIND named 9.10 (9.8 on one box) running as the local recursive resolver. /etc/resolv.conf on all is 127.0.0.1. So that means every lookup that isn't cached is going to the root NS's.
When it fails to resolve, named.log logs an entry like: 20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769 20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
For manual tests it only logs one, for the real-life sendmail problem ones, I'll see dozen/hundreds of the same thing trying hour after hour (usually around the sendmail queue retry times).
One of my boxes is *extremely* well connected in the US, and while it seems to have errors slightly less often, it still has them. All the rest are on various levels of Shaw or MTS, res or business.
This seems to have just started popping up maybe 6 months ago. It feels like it's getting worse.
I've setup an easy test, on the actual domains with the most problems:
rndc flush dig +short sportmanitoba.ca dig +short gymcan.org dig +short brandoneagles.ca dig +short interactivegym.org dig +short artscouncil.mb.ca
rndc flush dig +trace sportmanitoba.ca dig +trace gymcan.org dig +trace brandoneagles.ca dig +trace interactivegym.org dig +trace artscouncil.mb.ca
Maybe some others can run those tests on their boxes (but only if you're running BIND as caching resolver, which many/most people won't be).
Here's where it gets interesting the +short tests I can get to fail at least 1 of the domains (at random) about 1/8 of the time! On at least 5 different boxes out there! But +trace has never failed once on any box or any domain. It's like +trace does something different, maybe slowing the process down or something, that allows it to always succeed. (Failure for the +short is a missing line in output, +trace you have to look at the domain/ip returned near the bottom.)
So, is it just these particular domains?? Something wrong on their (DNS) side? Or is it more domains, not just these? Is there any way to diagnose what exactly is failing? I find it bizarre that *all* of these domains regularly go down for 4+ hours causing an email bounce!?! Or is there something horribly wrong on my BIND caching DNS servers?
Perhaps they are slow, and I'm my BINDs are just not waiting long enough. Is there a way to tell BIND to be more patient waiting for DNS packets to come in?
Maybe it's something regarding IPv6? (I'm doing this all in IPv4 and have no current interest in 6. And I'm only looking for A records.)
I have packet traces of the above sample commands when the lookups fail, but I can't really figure out what it's doing, other than one boatload of traffic for a tiny dns query. I can provide a trace privately on demand if you think you can help.
Even more odd, I never seem to have a problem with interactive things like web browsing. If this was a problem with all domains, I should see this in firefox all the time, but I don't. Maybe Firefox doesn't even obey resolv.conf and does its own thing, or retries heavily itself?
I also checked to ensure my iptables aren't dropping packets related to
this.
Lastly, answers of "just use 8.8.8.8" aren't helpful because I also need to handle dynamic local, and in some cases, external DNS (often with multiple views), all in the same BIND/box (and I like uniformity across
boxes for ease of admin). Sure, I could try another resolver, but I see no reason BIND can't be made to work, as it has for me for 20 years. And if this is a BIND bug, I want to submit it to help solve it.
Thanks guys! _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
On 2016-04-20 Adam Thompson wrote:
Without taking the time to examine these carefully, I'd guess that those domains are being served off less-than-stellar DNS servers, and
Theo found most were hosted at godaddy (I guess that what "domaincontrol.com" is?)... does that make your above statement less (or more!?!) likely? :-)
problem. Examine the chain of authoritative servers for each and I'll bet you find some commonalities. Also there are dozens of DNS "lint" tools that will help you track down other people's errors as well as your own. Best guess without testing: domain has 3-4 servers listed at gTLD, only 2-3 of those are authoritative for the domain, and
I'm digging into things looking at the available tools as you and Theo pointed to.
It's very bizarre, I just ran a quick test just now just manually typing dig <domain> one by one. On all but 1 of the domains I listed originally, dig immediately returned SERVFAIL on my first try! And when I up-arrowed 2s later and hit return to retry, each of those then succeeded (NOERROR).
The SERVFAIL ones return very quickly, all within 99-177ms. One outlier attempt that gave me SERVFAIL returned 1ms... I guess it had a a negative result cached (probably a sendmail queued for it).
Before I delve too much into this I'd sure love if someone else who runs BIND as recursive resolver (or maybe even dnsmasq, as long as it does its own recursion) could just try my +short test a few times to see if they can reproduce. Just cut & paste, takes 2 secs.... I have been known to have, shall we say, "customized" configs on relevant things like BIND and iptables.
rndc flush dig +short sportmanitoba.ca dig +short gymcan.org dig +short brandoneagles.ca dig +short interactivegym.org dig +short artscouncil.mb.ca
For kicks I added in 5 more domains that I never have problems with, like well known companies, certain user groups, and one that I control the DNS server of. I reran the test 11 times, about 10s apart. In 6/11 tries I got 1 SERVFAIL. The others had no errors. All 6 failures were for the above domains, never once the "known good" ones I just added. So that's encouraging. (The failures were on sportmanitoba.ca x2, artscouncil.mb.ca x3, brandoneagles.ca x1.)
So the theory of "badly behaved name servers beyond my control" looks like it might be correct.
Assume for a moment we are positive that is the case, should I be contacting someone on the other side about fixing this? I doubt the domain holders know/care about such technical things, but one would think the DNS hosting company might? (I certainly would want to know!)
Thanks!
On Apr 20, 2016, at 4:58 PM, Trevor Cordes trevor@tecnopolis.ca wrote:
On 2016-04-20 Adam Thompson wrote:
Without taking the time to examine these carefully, I'd guess that those domains are being served off less-than-stellar DNS servers, and
Theo found most were hosted at godaddy (I guess that what "domaincontrol.com" is?)... does that make your above statement less (or more!?!) likely? :-)
problem. Examine the chain of authoritative servers for each and I'll bet you find some commonalities. Also there are dozens of DNS "lint" tools that will help you track down other people's errors as well as your own. Best guess without testing: domain has 3-4 servers listed at gTLD, only 2-3 of those are authoritative for the domain, and
I'm digging into things looking at the available tools as you and Theo pointed to.
It's very bizarre, I just ran a quick test just now just manually typing dig <domain> one by one. On all but 1 of the domains I listed originally, dig immediately returned SERVFAIL on my first try! And when I up-arrowed 2s later and hit return to retry, each of those then succeeded (NOERROR).
The SERVFAIL ones return very quickly, all within 99-177ms. One outlier attempt that gave me SERVFAIL returned 1ms... I guess it had a a negative result cached (probably a sendmail queued for it).
Before I delve too much into this I'd sure love if someone else who runs BIND as recursive resolver (or maybe even dnsmasq, as long as it does its own recursion) could just try my +short test a few times to see if they can reproduce. Just cut & paste, takes 2 secs.... I have been known to have, shall we say, "customized" configs on relevant things like BIND and iptables.
rndc flush dig +short sportmanitoba.ca dig +short gymcan.org dig +short brandoneagles.ca dig +short interactivegym.org dig +short artscouncil.mb.ca
I just ran this from home on my caching bind resolver about 12 times and I had 1 dig +short gymcan.org fail out of all of the runs, and it came back almost immediately, not a 5s timeout. I am my own internet provider for myself (in terms of IP access) as of last Tuesday, so I know I'm not seeing Shaw or MTS middleware messing with my packets.
Theodore Baschak - AS395089 - Hextet Systems https://ciscodude.net/ - https://hextet.systems/ https://theodorebaschak.com/ - http://mbix.ca/
On 2016-04-20 Theodore Baschak wrote:
I just ran this from home on my caching bind resolver about 12 times and I had 1 dig +short gymcan.org fail out of all of the runs, and it came back almost immediately, not a 5s timeout. I am my own internet provider for myself (in terms of IP access) as of last Tuesday, so I know I'm not seeing Shaw or MTS middleware messing with my packets.
Wow! Thanks a ton! That rules out a lot of possibilities and makes me feel a lot better. I guess it's just something I will have to live with, as long as it's limited to a few domains only.
Isn't it curious how the failures occur super quickly, rather than taking some reasonable timeout amount of time... that's what I see too.
Interesting the others didn't fail for you, but I see that too, where one will be flakier than the others at random. I wouldn't be surprised if running the test at another time gives you a different result.
Further, during additional tests while capturing with tcpdump and visualizing with wireshark, tcpdump consistently tells me it is dropping 10-45 packets out of around 1200 it seems to capture to dig +short just 2 domains in a 5s test. Is that level of drop normal? I couldn't get a single capture that didn't have drops. Is this just tcpdump not keeping up, or are these drops also not making it to BIND itself?? (Though I tested and the same # of drops occur on the "good" domains too.)
Also, to rule out iptables I added an accept all rule for --sport 53 (udp and tcp) super early in my external interface chain. Didn't help one bit. But I'm now 99% sure it's not iptables.
I found 4 surprising things in wireshark:
1. After rndc flushing it takes about 600 packets to resolve one dinky domain name?? Wow!
2. AAAA records are coming across the wire, in fact they outnumber A records. I have ipv6 as "turned off"/blocked on a modern linux box as you can, so I'm not sure why AAAA is showing up, but I guess it's neither here nor there as long as the A's are coming back ok. They certainly add a bucket of useless packets to the 600 total though.
3. Looks like I'm getting back (must be automatic) dns sec stuff in some of these packets. I don't have any of that configured in my BIND config, so unless it "just works" with no new config lines, it probably is just being ignored.
4. There's a fair amount of TCP port 53 traffic going on! My guess would have been it was all limited to UDP. Guess I'm behind the times...
Not sure if any of these revelations sheds any light.
I don't think what you are seeing is normal and to me it's all hinting at something local. I feel that there is something common to all your setups causing this. I don't think it's the upstream DNS providers.
One thing that pops to mind is UDP packet fragmentation. Perhaps there is something in the network setup or filtering (iptables) which is causing UDP packets to fragment but is dropping the second part of the fragment? This is a surprisingly common problem on a lot of firewalls, for example Sonicwall. Perhaps force dig to use TCP to see if the results are different (dig +tcp <host>).
That's just one possibility among many. Swap out one of the machines for a totally different system (Windows laptop maybe?) and repeat the tests. If you don't get the same failure rate while on the same connection, then you know it's your setup.
John
On Wed, Apr 20, 2016 at 5:42 PM, Trevor Cordes trevor@tecnopolis.ca wrote:
Further, during additional tests while capturing with tcpdump and visualizing with wireshark, tcpdump consistently tells me it is dropping 10-45 packets out of around 1200 it seems to capture to dig +short just 2 domains in a 5s test. Is that level of drop normal? I couldn't get a single capture that didn't have drops. Is this just tcpdump not keeping up, or are these drops also not making it to BIND itself?? (Though I tested and the same # of drops occur on the "good" domains too.)
Also, to rule out iptables I added an accept all rule for --sport 53 (udp and tcp) super early in my external interface chain. Didn't help one bit. But I'm now 99% sure it's not iptables.
I found 4 surprising things in wireshark:
- After rndc flushing it takes about 600 packets to resolve one dinky
domain name?? Wow!
- AAAA records are coming across the wire, in fact they outnumber A
records. I have ipv6 as "turned off"/blocked on a modern linux box as you can, so I'm not sure why AAAA is showing up, but I guess it's neither here nor there as long as the A's are coming back ok. They certainly add a bucket of useless packets to the 600 total though.
- Looks like I'm getting back (must be automatic) dns sec stuff in
some of these packets. I don't have any of that configured in my BIND config, so unless it "just works" with no new config lines, it probably is just being ignored.
- There's a fair amount of TCP port 53 traffic going on! My guess
would have been it was all limited to UDP. Guess I'm behind the times...
Not sure if any of these revelations sheds any light. _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
On 2016-04-20 John Lange wrote:
I don't think what you are seeing is normal and to me it's all hinting at something local. I feel that there is something common to all your setups causing this. I don't think it's the upstream DNS providers.
Actually, you are probably correct in that this now seems to be a local BIND + upstream DNS problem. I guess I could try to setup dnsmasq, courtesy or MUUG's recent daemon-dash presentation, temporarily to see how that fares. I'm not sure what I'll find...
One thing that pops to mind is UDP packet fragmentation. Perhaps there is something in the network setup or filtering (iptables) which is causing UDP packets to fragment but is dropping the second part of the fragment? This is a surprisingly common problem on a lot of
I have this: $iptables -N fragments $iptables -A fragments -m limit --limit 10/minute -j LOG --log-prefix $p"99d fragments: " $iptables -A fragments -j DROP $iptables -A aad_first -f -j fragments
That runs very early in my rules to ditch all frags, but I just checked both the /v/l/messages where these are logged, and iptables -L -v |grep fragments and both show zero hits, nada, on all boxes I am testing on, even immediately after these SERVFAIL tests.
So that can't be it. (In general I have a (rate-limited) LOG before nearly every DROP in my iptables, so I should see /v/l/messages coming across if iptables was throwing things away during these tests. And I just confirmed that I am not hitting any of the drops that aren't logged.)
I thought about kernel-level (apart from iptables) frag dropping but I see nothing user settable (thought there might be an "on/off" switch like for /proc/sys/net/ipv4/conf/all/rp_filter. It appears to be something you only play with in iptables, not the kernel's sysfs.
firewalls, for example Sonicwall. Perhaps force dig to use TCP to see if the results are different (dig +tcp <host>).
Good idea. Curiouser and curiouser... I get 1-2 look failures on almost every single test when I use +tcp +short. That's worse than the previous tests (0-1 failures). That really does start to limit the problems!
That's just one possibility among many. Swap out one of the machines for a totally different system (Windows laptop maybe?) and repeat the
Windows won't help because it isn't running a local recursive resolver (well, I guess I could try Windows Server but that is beyond the scope...). It is a good idea though to try to replace what I can, perhaps a different distro or a BSD, or a different resolver.
The fact that +trace has yet to have any error at all means that it may be possible to make a resolver that won't fail in this way on these domains. That's why I think it might be something specific to BIND, perhaps. I doubt this happens if you set your resolv.conf to 8.8.8.8 because I bet the "big guys" are doing something more robust than BIND.
For kicks I just tried adding +tries=X to the dig commands, first =5 then =10 then =100 and the failure rate appears to stay pretty constant. Strangely, the time the commands take doesn't really change(?!).
Also, look into state and connection tracking in your iptables rules.
I'm using pretty stock idioms: $iptables -A inextern -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
very early in the ruleset. Also, my accept everything SPT=53 (u and t) temporarily rules should have caught any weird-state packets. Not ruling it out completely, as this is very complex stuff, but it's been ages (10+ years) since I've had a conntrack bug.
I'll keep hunting...
Thanks for all the tips y'all; keep them coming and I'll try 'em!
I turned on extreme debug logging on BIND named and triggered a SERVFAIL and here's what it shows:
21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594: UDP request 21-Apr-2016 00:44:55.592 client: debug 5: client 127.0.0.1#42594: using view '_default' 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594: request is not signed 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594: recursion available 21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594: query 21-Apr-2016 00:44:55.592 client: debug 10: client 127.0.0.1#42594 (gymcan.org): ns_client_attach: ref = 1 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594 (gymcan.org): query (cache) 'gymcan.org/A/IN' approved 21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594 (gymcan.org): replace 21-Apr-2016 00:44:55.592 client: debug 3: client @0x7f438001c6a0: udprecv 21-Apr-2016 00:44:56.224 query-errors: debug 1: client 127.0.0.1#42594 (gymcan.org): query failed (SERVFAIL) for gymcan.org/IN/A at query.c:7769 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 (gymcan.org): error 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 (gymcan.org): send 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 (gymcan.org): sendto 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 (gymcan.org): senddone 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 (gymcan.org): next 21-Apr-2016 00:44:56.225 client: debug 10: client 127.0.0.1#42594 (gymcan.org): ns_client_detach: ref = 0 21-Apr-2016 00:44:56.225 client: debug 3: client 127.0.0.1#42594 (gymcan.org): endrequest 21-Apr-2016 00:44:56.225 query-errors: debug 2: fetch completed at resolver.c:3658 for gymcan.org/A in 0.632030: SERVFAIL/success [domain:gymcan.org,referral:2,restart:4,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
Too bad they don't show even more info, but we can still wireshark the details.
So the error seems to be a "neterr", which bind docs say: The number of erroneous results that the resolver encountered in sending queries at the domain zone. One common case is the remote server is unreachable and the resolver receives an ICMP unreachable error message.
But I confirmed no ICMP unreachable came in. "One common case"... I wonder what the other cases are!
Aside: I wiresharked making sure to capture ICMP as well and no ICMP came across during the SERVAIL, so that also helps to exclude fragmentation issues as they should trigger a ICMP can't-fragment packet.
The problem: IPv6! Argh!
After that named debug I just posted I poked around named.conf's debug options a bit more and saw that there were whole classes ("category") of logs I could enable. So I turned everything on, level 10 and hit the bug. Tada! Plain as day there's my problem staring me in the face:
21-Apr-2016 01:06:05.188 lame-servers: info: network unreachable resolving 'brandoneagles.ca/A/IN': 2607:0:2:4:216:36:178:2#53 21-Apr-2016 01:06:05.188 query-errors: debug 1: client 127.0.0.1#44035 (brandoneagles.ca): query failed (SERVFAIL) for brandoneagles.ca/IN/A at query.c:7769
lame-servers network unreachable for an /A/ record attempting to use a distinctly IPv6y-looking address. One minute later after a quick google, I see the solution, must run BIND in "-4" mode. It's not enough to "disable" v6 by not listening on any v6 address. You must switch BIND into a "completely 4, never use 6" mode.
On Fedora or RHEL/CentOS/Oracle, edit /etc/sysconfig/named, add: OPTIONS="-4"
Problem completely disappears immediately upon repeated tests on many different boxes.
(Hilarious, I probably would have seen ICMP messages had my tcpdump or ip6tables been set to log them! I guess tcpdump is just v4 by default?)
So the reason the original symptom seemed to get worse is that all these various NS's must slowly be turning on IPv6 DNS serving, one by one. And the reason it doesn't happen on every query is some sort of round-robin or random server picking must be going on by either BIND or upstream DNS servers that are randomly telling me to use my IPv6 interface to look up an A record? Dunno, that's my guess.
Hmm, gets me wondering still why the "big guys" I added to my earlier tests, who almost certainly also have v6 enabled, never SERVFAILd? Perhaps there is a DNS bug on the remote side where the buggy guys are sending me AAAA records when they aren't supposed to? Looking at the other named debug output I can see that, in this case, they send me back just before the final lookup:
;ns1.westmancom.com. 172800 IN A 216.36.128.2 ;ns1.westmancom.com. 172800 IN AAAA 2607:0:2:1:216:36:128:2 ;ns2.westmancom.com. 172800 IN A 216.36.128.3 ;ns2.westmancom.com. 172800 IN AAAA 2607:0:2:1:216:36:128:3
And almost certainly my resolver is picking one of records at random, failing 50% of the time when it chooses AAAA.
Why is the v6 failing on my boxes you ask? Because I have it as "off" as you can be in Fedora without messing up the whole OS (learned this the hard way a long time ago):
#ip6tables -L -n -v Chain INPUT (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 239 12170 ACCEPT all lo * ::/0 ::/0
Chain FORWARD (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination
Chain OUTPUT (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 239 12170 ACCEPT all * lo ::/0 ::/0
On purpose. Last I checked, me and my customer's Shaw and MTS (some residential!) connections didn't provide IPv6 ability/routing, but then again, I haven't checked in forever. If the ISP's I work with don't support it, v6 is completely a moot point for me (no, I don't want to tunnel). So I always do my best to disable v6. I don't want to be on the bleeding edge of networking, I just want stuff to work and be as secure as possible. Translating my massive v4 iptables ruleset into v6 is for a day I hope to long postpone.
I recall a long time back I had the same problem with squid, and had to figure out how to de-v6 squid to get things to work. I guess nearly every daemon will need this at some point.
<PRO-ACTIVE RANT TO WARD OFF v6ER RANTS> I wish the world would just pick a day where we switch off IPv4 completely and are all forced to use IPv6. I really hate the current hybrid approach. I would hazard a guess 95% of the home user / small business world (where I dwell) will be using IPv4 for 95% of their traffic for another 10 years. They said 4 was dead 20 years ago and here we are, still 95/95+. Heck, forget 10 years, it may be forever!
The day I switch is the day that a) some "real life" important servers go 6-only, and b) every server in the world supports 6. On that day I will change everything to 6 and disable 4. If the day comes where (a) occurs before (b) I will *not* be a happy camper as that means we'll all be forced to hybrid.
Good luck to all the valiant early adopters of 6 trying to fight the good fight, hopefully you'll make "switchover day" a much easier battle. Until then, OPTIONS="-4"!! (And thanks IPv6 for wasting 3 more of my evenings!) </PRO-ACTIVE RANT TO WARD OFF v6ER RANTS>
On Thu, 21 Apr 2016, Trevor Cordes wrote:
The problem: IPv6! Argh!
Is there a firewall blocking *TCP* port 53? With DNSSEC and IPv6 we will see more DNS responses that are too big to send over UDP.
Ref: http://serverfault.com/questions/404840/when-do-dns-queries-use-tcp-instead-...
Regards, Daryl
On 2016-06-11 Daryl F wrote:
On Thu, 21 Apr 2016, Trevor Cordes wrote:
The problem: IPv6! Argh!
Is there a firewall blocking *TCP* port 53? With DNSSEC and IPv6 we will see more DNS responses that are too big to send over UDP.
Since this would be outgoing we're talking about, nope my firewalls do not block tcp p53 in IPv4, as I'm pretty open about chain OUTPUT (but not FORWARD).
I discussed the issue further with a group of MUUG guys at a meeting and almost definitely we believe (haven't tested yet) the problem was two things: 1) I dropped all IPv6 2) I *dropped*, not *rejected*, all IPv6
We believe that both have to hold, hence why not many people hit this bug, as many people (and most stock routers) would default reject (iptables ... -j REJECT) which *probably* will tell BIND to immediately give up on 6 and try 4, mitigating the bug. That's the theory anyhow.
I'll test it one of these days...
I ran dig +short gymcan.org a whole pile of times and it never failed for me. I also ran it directly against the authoritative name server (dig @ ns06.domaincontrol.com. gymcan.org) with the same result (no failures).
Also, I monitored it with tcpdump and the packet size is not larger than 72 bytes so fragmentation is unlikely.
I'm still suspicious of the iptables setup. I'd try stopping the firewall entirely (set them all to -ACCEPT and flush the rules) and run the tests again just to fully rule that out.
I think the thing you need to solve is why are you dropping packets? That isn't normal and since it's spread across multiple servers on different providers, it's most likely your config.
John
On Thu, Apr 21, 2016 at 12:55 AM, Trevor Cordes trevor@tecnopolis.ca wrote:
I turned on extreme debug logging on BIND named and triggered a SERVFAIL and here's what it shows:
21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594: UDP request 21-Apr-2016 00:44:55.592 client: debug 5: client 127.0.0.1#42594: using view '_default' 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594: request is not signed 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594: recursion available 21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594: query 21-Apr-2016 00:44:55.592 client: debug 10: client 127.0.0.1#42594 ( gymcan.org): ns_client_attach: ref = 1 21-Apr-2016 00:44:55.592 security: debug 3: client 127.0.0.1#42594 ( gymcan.org): query (cache) 'gymcan.org/A/IN' approved 21-Apr-2016 00:44:55.592 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): replace 21-Apr-2016 00:44:55.592 client: debug 3: client @0x7f438001c6a0: udprecv 21-Apr-2016 00:44:56.224 query-errors: debug 1: client 127.0.0.1#42594 ( gymcan.org): query failed (SERVFAIL) for gymcan.org/IN/A at query.c:7769 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): error 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): send 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): sendto 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): senddone 21-Apr-2016 00:44:56.224 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): next 21-Apr-2016 00:44:56.225 client: debug 10: client 127.0.0.1#42594 ( gymcan.org): ns_client_detach: ref = 0 21-Apr-2016 00:44:56.225 client: debug 3: client 127.0.0.1#42594 ( gymcan.org): endrequest 21-Apr-2016 00:44:56.225 query-errors: debug 2: fetch completed at resolver.c:3658 for gymcan.org/A in 0.632030: SERVFAIL/success [domain: gymcan.org ,referral:2,restart:4,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
Too bad they don't show even more info, but we can still wireshark the details.
So the error seems to be a "neterr", which bind docs say: The number of erroneous results that the resolver encountered in sending queries at the domain zone. One common case is the remote server is unreachable and the resolver receives an ICMP unreachable error message.
But I confirmed no ICMP unreachable came in. "One common case"... I wonder what the other cases are!
Aside: I wiresharked making sure to capture ICMP as well and no ICMP came across during the SERVAIL, so that also helps to exclude fragmentation issues as they should trigger a ICMP can't-fragment packet. _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
On Apr 20, 2016, at 2:03 AM, Trevor Cordes trevor@tecnopolis.ca wrote:
When I run tests at the command line (dig) on the domains in question (the ones I've seen in email bounces) I will often very quickly get a resolve failure, but usually 5s later another dig to the same domain will resolve 100% ok!
I often use a utility called "check-soa" to check that each of the nameservers listed in the last NS response respond with an SOA https://github.com/bortzmeyer/check-soa https://github.com/bortzmeyer/check-soa When running it against each of the 5 domains below I did experience occasional delays in command line output, I'm guessing since none of the latency values reported went up, that this delay was from the original NS response
Every box I am testing on has a similar config with BIND named 9.10 (9.8 on one box) running as the local recursive resolver. /etc/resolv.conf on all is 127.0.0.1. So that means every lookup that isn't cached is going to the root NS's.
When it fails to resolve, named.log logs an entry like: 20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769 20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
For manual tests it only logs one, for the real-life sendmail problem ones, I'll see dozen/hundreds of the same thing trying hour after hour (usually around the sendmail queue retry times).
One of my boxes is *extremely* well connected in the US, and while it seems to have errors slightly less often, it still has them. All the rest are on various levels of Shaw or MTS, res or business.
This seems to have just started popping up maybe 6 months ago. It feels like it's getting worse.
I've setup an easy test, on the actual domains with the most problems:
rndc flush dig +short sportmanitoba.ca dig +short gymcan.org dig +short brandoneagles.ca dig +short interactivegym.org dig +short artscouncil.mb.ca
rndc flush dig +trace sportmanitoba.ca dig +trace gymcan.org dig +trace brandoneagles.ca dig +trace interactivegym.org dig +trace artscouncil.mb.ca
Maybe some others can run those tests on their boxes (but only if you're running BIND as caching resolver, which many/most people won't be).
Here's where it gets interesting the +short tests I can get to fail at least 1 of the domains (at random) about 1/8 of the time! On at least 5 different boxes out there! But +trace has never failed once on any box or any domain. It's like +trace does something different, maybe slowing the process down or something, that allows it to always succeed. (Failure for the +short is a missing line in output, +trace you have to look at the domain/ip returned near the bottom.)
AFAIK +trace doesn't use your caching resolver except to get an answer for which nameservers/IPs to query the root/. at so this is definitely doing something different.
So, is it just these particular domains?? Something wrong on their (DNS) side? Or is it more domains, not just these? Is there any way to diagnose what exactly is failing? I find it bizarre that *all* of these domains regularly go down for 4+ hours causing an email bounce!?! Or is there something horribly wrong on my BIND caching DNS servers?
4 of 5 of these domains are on godaddy, the other has DNS handled by Westman.
Perhaps they are slow, and I'm my BINDs are just not waiting long enough. Is there a way to tell BIND to be more patient waiting for DNS packets to come in?
Normally DNS has a multi-second timeout. I'm not sure of the technical details of how bind handles SRVFAILs but Bind does note which servers respond quicker and weights those.
Maybe it's something regarding IPv6? (I'm doing this all in IPv4 and have no current interest in 6. And I'm only looking for A records.)
If you don't have v6 addresses on things, generally they won't be asking for quad A records so I don't think this should be an issue for you.
I have packet traces of the above sample commands when the lookups fail, but I can't really figure out what it's doing, other than one boatload of traffic for a tiny dns query. I can provide a trace privately on demand if you think you can help.
Even more odd, I never seem to have a problem with interactive things like web browsing. If this was a problem with all domains, I should see this in firefox all the time, but I don't. Maybe Firefox doesn't even obey resolv.conf and does its own thing, or retries heavily itself?
I also checked to ensure my iptables aren't dropping packets related to this.
Lastly, answers of "just use 8.8.8.8" aren't helpful because I also need to handle dynamic local, and in some cases, external DNS (often with multiple views), all in the same BIND/box (and I like uniformity across boxes for ease of admin). Sure, I could try another resolver, but I see no reason BIND can't be made to work, as it has for me for 20 years. And if this is a BIND bug, I want to submit it to help solve it.
Thanks guys! _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
Theodore Baschak - AS395089 - Hextet Systems https://ciscodude.net/ - https://hextet.systems/ https://theodorebaschak.com/ - http://mbix.ca/