The problem: IPv6! Argh!
After that named debug I just posted I poked around named.conf's debug options a bit more and saw that there were whole classes ("category") of logs I could enable. So I turned everything on, level 10 and hit the bug. Tada! Plain as day there's my problem staring me in the face:
21-Apr-2016 01:06:05.188 lame-servers: info: network unreachable resolving 'brandoneagles.ca/A/IN': 2607:0:2:4:216:36:178:2#53 21-Apr-2016 01:06:05.188 query-errors: debug 1: client 127.0.0.1#44035 (brandoneagles.ca): query failed (SERVFAIL) for brandoneagles.ca/IN/A at query.c:7769
lame-servers network unreachable for an /A/ record attempting to use a distinctly IPv6y-looking address. One minute later after a quick google, I see the solution, must run BIND in "-4" mode. It's not enough to "disable" v6 by not listening on any v6 address. You must switch BIND into a "completely 4, never use 6" mode.
On Fedora or RHEL/CentOS/Oracle, edit /etc/sysconfig/named, add: OPTIONS="-4"
Problem completely disappears immediately upon repeated tests on many different boxes.
(Hilarious, I probably would have seen ICMP messages had my tcpdump or ip6tables been set to log them! I guess tcpdump is just v4 by default?)
So the reason the original symptom seemed to get worse is that all these various NS's must slowly be turning on IPv6 DNS serving, one by one. And the reason it doesn't happen on every query is some sort of round-robin or random server picking must be going on by either BIND or upstream DNS servers that are randomly telling me to use my IPv6 interface to look up an A record? Dunno, that's my guess.
Hmm, gets me wondering still why the "big guys" I added to my earlier tests, who almost certainly also have v6 enabled, never SERVFAILd? Perhaps there is a DNS bug on the remote side where the buggy guys are sending me AAAA records when they aren't supposed to? Looking at the other named debug output I can see that, in this case, they send me back just before the final lookup:
;ns1.westmancom.com. 172800 IN A 216.36.128.2 ;ns1.westmancom.com. 172800 IN AAAA 2607:0:2:1:216:36:128:2 ;ns2.westmancom.com. 172800 IN A 216.36.128.3 ;ns2.westmancom.com. 172800 IN AAAA 2607:0:2:1:216:36:128:3
And almost certainly my resolver is picking one of records at random, failing 50% of the time when it chooses AAAA.
Why is the v6 failing on my boxes you ask? Because I have it as "off" as you can be in Fedora without messing up the whole OS (learned this the hard way a long time ago):
#ip6tables -L -n -v Chain INPUT (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 239 12170 ACCEPT all lo * ::/0 ::/0
Chain FORWARD (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination
Chain OUTPUT (policy DROP 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 239 12170 ACCEPT all * lo ::/0 ::/0
On purpose. Last I checked, me and my customer's Shaw and MTS (some residential!) connections didn't provide IPv6 ability/routing, but then again, I haven't checked in forever. If the ISP's I work with don't support it, v6 is completely a moot point for me (no, I don't want to tunnel). So I always do my best to disable v6. I don't want to be on the bleeding edge of networking, I just want stuff to work and be as secure as possible. Translating my massive v4 iptables ruleset into v6 is for a day I hope to long postpone.
I recall a long time back I had the same problem with squid, and had to figure out how to de-v6 squid to get things to work. I guess nearly every daemon will need this at some point.
<PRO-ACTIVE RANT TO WARD OFF v6ER RANTS> I wish the world would just pick a day where we switch off IPv4 completely and are all forced to use IPv6. I really hate the current hybrid approach. I would hazard a guess 95% of the home user / small business world (where I dwell) will be using IPv4 for 95% of their traffic for another 10 years. They said 4 was dead 20 years ago and here we are, still 95/95+. Heck, forget 10 years, it may be forever!
The day I switch is the day that a) some "real life" important servers go 6-only, and b) every server in the world supports 6. On that day I will change everything to 6 and disable 4. If the day comes where (a) occurs before (b) I will *not* be a happy camper as that means we'll all be forced to hybrid.
Good luck to all the valiant early adopters of 6 trying to fight the good fight, hopefully you'll make "switchover day" a much easier battle. Until then, OPTIONS="-4"!! (And thanks IPv6 for wasting 3 more of my evenings!) </PRO-ACTIVE RANT TO WARD OFF v6ER RANTS>