On Apr 20, 2016, at 2:03 AM, Trevor Cordes <trevor@tecnopolis.ca> wrote:

When I run tests at the command line (dig) on the domains in question (the
ones I've seen in email bounces) I will often very quickly get a resolve
failure, but usually 5s later another dig to the same domain will resolve
100% ok!

I often use a utility called "check-soa" to check that each of the nameservers listed in the last NS response respond with an SOA

When running it against each of the 5 domains below I did experience occasional delays in command line output, I'm guessing since none of the latency values reported went up, that this delay was from the original NS response

Every box I am testing on has a similar config with BIND named 9.10 (9.8
on one box) running as the local recursive resolver. /etc/resolv.conf on
all is 127.0.0.1. So that means every lookup that isn't cached is going
to the root NS's.

When it fails to resolve, named.log logs an entry like:
20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769
20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]

For manual tests it only logs one, for the real-life sendmail problem
ones, I'll see dozen/hundreds of the same thing trying hour after hour
(usually around the sendmail queue retry times).

One of my boxes is *extremely* well connected in the US, and while it
seems to have errors slightly less often, it still has them. All the rest
are on various levels of Shaw or MTS, res or business.

This seems to have just started popping up maybe 6 months ago. It feels
like it's getting worse.

I've setup an easy test, on the actual domains with the most problems:

rndc flush
dig +short sportmanitoba.ca
dig +short gymcan.org
dig +short brandoneagles.ca
dig +short interactivegym.org
dig +short artscouncil.mb.ca

rndc flush
dig +trace sportmanitoba.ca
dig +trace gymcan.org
dig +trace brandoneagles.ca
dig +trace interactivegym.org
dig +trace artscouncil.mb.ca

Maybe some others can run those tests on their boxes (but only if you're
running BIND as caching resolver, which many/most people won't be).

Here's where it gets interesting the +short tests I can get to fail at
least 1 of the domains (at random) about 1/8 of the time! On at least 5
different boxes out there! But +trace has never failed once on any box or
any domain. It's like +trace does something different, maybe slowing the
process down or something, that allows it to always succeed. (Failure for
the +short is a missing line in output, +trace you have to look at the
domain/ip returned near the bottom.)

AFAIK +trace doesn't use your caching resolver except to get an answer for which nameservers/IPs to query the root/. at so this is definitely doing something different.

So, is it just these particular domains?? Something wrong on their (DNS)
side? Or is it more domains, not just these? Is there any way to
diagnose what exactly is failing? I find it bizarre that *all* of these
domains regularly go down for 4+ hours causing an email bounce!?! Or is
there something horribly wrong on my BIND caching DNS servers?

4 of 5 of these domains are on godaddy, the other has DNS handled by Westman.

Perhaps they are slow, and I'm my BINDs are just not waiting long enough.
Is there a way to tell BIND to be more patient waiting for DNS packets to
come in?

Normally DNS has a multi-second timeout. I'm not sure of the technical details of how bind handles SRVFAILs but Bind does note which servers respond quicker and weights those.

Maybe it's something regarding IPv6? (I'm doing this all in IPv4 and have
no current interest in 6. And I'm only looking for A records.)

If you don't have v6 addresses on things, generally they won't be asking for quad A records so I don't think this should be an issue for you.

I have packet traces of the above sample commands when the lookups fail,
but I can't really figure out what it's doing, other than one boatload of
traffic for a tiny dns query. I can provide a trace privately on demand
if you think you can help.

Even more odd, I never seem to have a problem with interactive things like
web browsing. If this was a problem with all domains, I should see this
in firefox all the time, but I don't. Maybe Firefox doesn't even obey
resolv.conf and does its own thing, or retries heavily itself?

I also checked to ensure my iptables aren't dropping packets related to
this.

Lastly, answers of "just use 8.8.8.8" aren't helpful because I also need
to handle dynamic local, and in some cases, external DNS (often with
multiple views), all in the same BIND/box (and I like uniformity across
boxes for ease of admin). Sure, I could try another resolver, but I see
no reason BIND can't be made to work, as it has for me for 20 years. And
if this is a BIND bug, I want to submit it to help solve it.

Thanks guys!
_______________________________________________
Roundtable mailing list
Roundtable@muug.mb.ca
http://www.muug.mb.ca/mailman/listinfo/roundtable