Every box I am testing on has a similar config with BIND named 9.10 (9.8
on one box) running as the local recursive resolver. /etc/resolv.conf on
all is 127.0.0.1. So that means every lookup that isn't cached is going
to the root NS's.
When it fails to resolve, named.log logs an entry like:
20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (
artscouncil.mb.ca): query failed (SERVFAIL) for
artscouncil.mb.ca/IN/A at query.c:7769
20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for
artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
For manual tests it only logs one, for the real-life sendmail problem
ones, I'll see dozen/hundreds of the same thing trying hour after hour
(usually around the sendmail queue retry times).
One of my boxes is *extremely* well connected in the US, and while it
seems to have errors slightly less often, it still has them. All the rest
are on various levels of Shaw or MTS, res or business.
This seems to have just started popping up maybe 6 months ago. It feels
like it's getting worse.
I've setup an easy test, on the actual domains with the most problems:
rndc flush
dig +short
sportmanitoba.cadig +short
gymcan.orgdig +short
brandoneagles.cadig +short
interactivegym.orgdig +short
artscouncil.mb.carndc flush
dig +trace
sportmanitoba.cadig +trace
gymcan.orgdig +trace
brandoneagles.cadig +trace
interactivegym.orgdig +trace
artscouncil.mb.caMaybe some others can run those tests on their boxes (but only if you're
running BIND as caching resolver, which many/most people won't be).
Here's where it gets interesting the +short tests I can get to fail at
least 1 of the domains (at random) about 1/8 of the time! On at least 5
different boxes out there! But +trace has never failed once on any box or
any domain. It's like +trace does something different, maybe slowing the
process down or something, that allows it to always succeed. (Failure for
the +short is a missing line in output, +trace you have to look at the
domain/ip returned near the bottom.)