Last modified: 2014-09-06 19:58:54 UTC
DNS look-ups from Labs instances for the IP addresses of Labs instances fail every once in a while (all times UTC): | tools-webproxy.eqiad.wmflabs : Aug 24 20:09:33 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 25 18:23:00 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 25 23:28:06 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | Date: Sat, 23 Aug 2014 20:27:11 +0000 (3 days, 5 hours ago) | error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name | Date: Tue, 26 Aug 2014 06:39:11 +0000 (19 hours, 30 minutes ago) | error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name
This happens several times each day during the beta-scap-eqiad Jenkins job [0] as well: ssh: Could not resolve hostname deployment-mediawiki02.eqiad.wmflabs: Name or service not known Ths failing host name varies from run to run and generally works if the job is re-run immediately upon notification of failure. [0]: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/
Today the frequency of those occurences has increased quite a bit at Tools: | tools-webproxy.eqiad.wmflabs : Aug 27 00:03:35 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 01:25:37 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 02:25:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 03:40:40 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 06:24:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 11:39:49 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | tools-webproxy.eqiad.wmflabs : Aug 27 14:25:02 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs | Date: Wed, 27 Aug 2014 07:05:11 +0000 (8 hours, 15 minutes ago) | error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name I don't remember consciously that hosts besides tools-master and tools-webproxy were affected in the past in a major way, and with tools-webproxy probably doing a lot of lookups for the log files, my assumption is that the DNS server/OpenStack's/Ubuntu's network layer has some throttling in place per client host. It may be worth a test to install a caching dnsmasq locally to see if that solves the problem. In that case, non-cached queries need to be forwarded to the Labs DNS server so that the special rewrites in openstack::network-service are honoured.
dnsmasq is a [bleep]ing piece of unreliable [bleep] that crumbles under the lightest load. I've been meaning to have a real DNS server in labs for a while now, and this increase in failures just bumped that up in priority.
I ran "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all Tools instances to get an idea of the order of magnitude between tools-master, tools-webproxy and the rest. To display: | scfc@tools-login:~$ sudo iptables -nvxL | Chain INPUT (policy ACCEPT 20004 packets, 10193318 bytes) | pkts bytes target prot opt in out source destination | Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) | pkts bytes target prot opt in out source destination | Chain OUTPUT (policy ACCEPT 19480 packets, 4383218 bytes) | pkts bytes target prot opt in out source destination | 139 9403 udp -- * * 0.0.0.0/0 10.68.16.1 udp dpt:53 | scfc@tools-login:~$
Replacing dnsmasq is... more complicated than reasonable because of the way it's being invoked and managed by Openstack. As a first pass, I'm going to enable local caching of name resolution; I expect this will lighten the load on it by an order of magnitude or more and will make resolution more robust even if it does falter.
Change 157816 had a related patch set uploaded by coren: Labs: provide saner nscd defaults https://gerrit.wikimedia.org/r/157816
Change 157816 merged by coren: Labs: provide saner nscd defaults https://gerrit.wikimedia.org/r/157816
I reset the counters at 16:15Z because after the merge data from before and after is hard to compare :-). I suggest that we revisit this after Thursday (2014-09-04); if the error mails have stopped then (or significantly decreased), I would consider this issue fixed.
I haven't seen any errors since Tuesday morning, so the change of the nscd configuration seems to have fixed the issue.
Just recurred
(In reply to jeremyb from comment #10) > Just recurred To be precise: | [...] | Date: Sat, 06 Sep 2014 19:02:02 +0000 (50 minutes, 59 seconds ago) | error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name | [...] If that would remain the only occurence, I still would consider this fixed.