[buug] semi-random example of DNS, etc. troubleshooting (www.meetup.com., etc.)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Fri Feb 28 07:00:26 PST 2014


So, random example of some DNS (etc.) troubleshooting, and what *not* to do
with your production DNS (at least if you can avoid it! - not always
possible in some circumstances).

So, yesterday, 2014-02-27, a bit after noon, I notice some issues
accessing www.meetup.com. and secure.meetup.com.  I figure it might be
issue with the proxy/firewall stuff at work, so I figure I'll check
again later where I don't have anything potentially mucking about with
my general access to The Internet.  So, many hours later, I check from
home ... still not working, dig a little deeper.

What am I finding in DNS - and likely many other DNS servers doing the
appropriate expected caching?

$ dig -t A www.meetup.com. +noall +answer | grep '^[    ]*[^    ;]'
www.meetup.com.         208     IN      A       38.123.132.30
$

and test connectivity with that:
$ nc -z 38.123.132.30 80 || echo FAILED
FAILED
$
It doesn't work.  Also use of traceroute with options and option
arguments -n -T -p 80 didn't show anything particularly interesting
(never got much of a response beyond about 1st hop or so, and never
connected).

So, I do a trace on DNS, to see if results are rather consistent - and
alas, quite different:
$ dig -t A www.meetup.com. +trace | grep '^[    ]*[^    ;]' | fgrep  
www.meetup.com.
www.meetup.com.         300     IN      A       190.93.246.143
www.meetup.com.         300     IN      A       141.101.114.144
www.meetup.com.         300     IN      A       190.93.244.143
www.meetup.com.         300     IN      A       190.93.247.143
www.meetup.com.         300     IN      A       190.93.245.143
$

But do those IP addresses work for TCP connections on port 80?
$ (for a in 190.93.246.143 141.101.114.144 190.93.244.143  
190.93.247.143 190.93.245.143; do nc -z "$a" 80 || echo FALIED; done)
$
Yes, they all connected, no errors on any of them.

So, why/where are we getting different DNS data from cached and not?
Our cached NS records:
$ dig -t NS meetup.com. +noall +answer | grep '^[       ]*[^    ;]'
meetup.com.             5793    IN      NS      ns4.p06.dynect.net.
meetup.com.             5793    IN      NS      ns3.p06.dynect.net.
meetup.com.             5793    IN      NS      ns1.p06.dynect.net.
meetup.com.             5793    IN      NS      ns2.p06.dynect.net.
$

What about uncached?
$ dig -t NS meetup.com. +trace +noall +answer | grep '^[        ]*[^    
  ;]' | fgrep meetup.com
meetup.com.             86400   IN      NS      tom.ns.cloudflare.com.
meetup.com.             86400   IN      NS      lisa.ns.cloudflare.com.
$
A very different set of results.

@what are the com. NS records? (these also change quite infrequently)
$ dig -t NS com. +noall +answer | grep '^[      ]*[^    ;]'
com.                    159466  IN      NS      b.gtld-servers.net.
com.                    159466  IN      NS      j.gtld-servers.net.
com.                    159466  IN      NS      i.gtld-servers.net.
com.                    159466  IN      NS      l.gtld-servers.net.
com.                    159466  IN      NS      a.gtld-servers.net.
com.                    159466  IN      NS      d.gtld-servers.net.
com.                    159466  IN      NS      h.gtld-servers.net.
com.                    159466  IN      NS      m.gtld-servers.net.
com.                    159466  IN      NS      g.gtld-servers.net.
com.                    159466  IN      NS      k.gtld-servers.net.
com.                    159466  IN      NS      c.gtld-servers.net.
com.                    159466  IN      NS      f.gtld-servers.net.
com.                    159466  IN      NS      e.gtld-servers.net.
$

$ let's see what one of those has to say about NS for meetup.com.
$ dig @e.gtld-servers.net. -t NS meetup.com. +noall +answer +authority  
+comment | sed -ne '/^;; A/{;p;d;};/^[^;]/p'
;; AUTHORITY SECTION:
meetup.com.             172800  IN      NS      tom.ns.cloudflare.com.
meetup.com.             172800  IN      NS      lisa.ns.cloudflare.com.
$
most particularly note the TTL of 172800 seconds (== 48 hours)
TTLs of Authority records for NS of subdomains directly under com. also
very rarely change - so those have probably been TTLs of 48 hours, for
a very long time.  So, almost certainly the change has occurred within
the past 48 hours.

So, would appear meetup.com. quite recently changed their NS servers,
however the older NS servers give A records to IPs for www.meetup.com.
that no longer work - and these haven't yet expired from cache.  So that
means that for many, meetup.com has broken their service for possibly as
long as up to 48 hours for many users.

Do we see other corroborating evidence of such a recent change?
$ 2>&1 whois -H meetup.com | fgrep -i -e 'Name Server' -e 'Updated Date:'
    Name Server: LISA.NS.CLOUDFLARE.COM
    Name Server: TOM.NS.CLOUDFLARE.COM
    Updated Date: 27-feb-2014
Updated Date: 2014-02-27T14:12:40-0800
Name Server: lisa.ns.cloudflare.com
Name Server: tom.ns.cloudflare.com
$
Yes - looks like they updated it 2014-02-27

If I take no manual explicit action, when should it be "all better"?
$ dig -t NS meetup.com. +noall +answer | grep '^[       ]*[^    ;]'
meetup.com.             4179    IN      NS      ns3.p06.dynect.net.
meetup.com.             4179    IN      NS      ns1.p06.dynect.net.
meetup.com.             4179    IN      NS      ns4.p06.dynect.net.
meetup.com.             4179    IN      NS      ns2.p06.dynect.net.
$
In another 4179 seconds, when the apparently no longer usefully
functional for [www.]meetup.com. NS records expire.  At any point after
that, queries for data in/under meetup.com. will have to go back up to
com. NS server, and will pick up the updated authority records, and
follow that and find and get the updated NS records, and then all will
be fine again.  Doing a wee bit 'o searching, appears I'm not the only
one having run into meetup.com's booboo today, e.g. searching Twitter.com.:
Anna Brown
#@mediagirl
Is there a DNS outage today? http://meetup.com and  
http://statcounter.com are both down.
12:45 PM - 27 Feb 2014

Now, ... not sure exactly what issues meetup.com was dealing with, but
if they could've switched their NS servers more than 48 hours in
advance of the old IPs and old service no longer working, they would
have avoided all disruptions of service.  But in not having managed to
do that (perhaps there was unexpected failure of the old IPs/service?),
they caused at least some disruptions.  And it's not with their power
(or anyone else's) to flush out the older DNS cached data ahead of the
TTL values it's already existed under.

And, something over 4179 seconds later, we now have:
$ dig -t NS meetup.com. +noall +answer | grep '^[       ]*[^    ;]'
meetup.com.             86400   IN      NS      lisa.ns.cloudflare.com.
meetup.com.             86400   IN      NS      tom.ns.cloudflare.com.
$
Sweet, ... lets see if that all looks good now ...
$ dig -t A www.meetup.com. +noall +answer | grep '^[    ]*[^    ;]'
www.meetup.com.         300     IN      A       190.93.244.143
www.meetup.com.         300     IN      A       190.93.245.143
www.meetup.com.         300     IN      A       190.93.246.143
www.meetup.com.         300     IN      A       190.93.247.143
www.meetup.com.         300     IN      A       141.101.114.144
$ (for a in 190.93.244.143 190.93.245.143 190.93.246.143  
190.93.247.143; do nc -z "$a" 80 || echo FALIED; done)
$ dig -t A secure.meetup.com. +noall +answer | grep '^[         ]*[^    ;]'
secure.meetup.com.      300     IN      A       190.93.246.143
secure.meetup.com.      300     IN      A       190.93.247.143
secure.meetup.com.      300     IN      A       141.101.114.144
secure.meetup.com.      300     IN      A       190.93.244.143
secure.meetup.com.      300     IN      A       190.93.245.143
$ (for a in 190.93.246.143 190.93.247.143 141.101.114.144  
190.93.244.143; do nc -z "$a" 443 || echo FALIED; done)
$
All that looks good, ... and the acid test - try using site with browser
from same client host ...
Well, no longer a DNS issue anyway, ... but they still are having issues:
Website is offline No cached version of this page is available.
Error 522 Ray ID: 103d8d1171de0295
Connection timed out
You
Browser
Working
San Jose
CloudFlare
Working
www.meetup.com
Host
Error
What happened?
The initial connection between CloudFlare's network and the origin web  
server timed out. As a result, the web page can not be displayed.
What can I do?
If you're a visitor of this website:
Please try again in a few minutes.
If you're the owner of this website:
Contact your hosting provider letting them know your web server is not  
completing requests. An Error 522 means that the request was able to  
connect to your web server, but that the request didn't finish. The  
most likely cause is that something on your server is hogging  
resources. Additional troubleshooting information here.

Following the link at the end:
https://support.cloudflare.com/hc/en-us/articles/200171906-Error-522
Gives some general details about the 522 error that they're reporting,
but without any specifics about exactly why (e.g. DNS name used or
attempted, IP address(es) used or attempted, port, response or timeout
or whatever) ... so, something for meetup.com to figure out between
themselves and their service provider (apparently cloudflare.com).

doing some searches ... Google ... Google News ... Twitter ...:
Jonathan Carter #@jonathanrcarter  4h
@Register_com Also we are slowly coming back up with our DNS requests.  
  http://meetup.com  had a ddos yesterday - could be linked
Collapse  Reply  Retweet  Favorite   More
1:02 AM - 28 Feb 2014

Hmmmm, possibly related?  But seems likely unrelated - at least to the
DNS issue seen earlier - though it might explain why
http://www.meetup.com/ is still effectively down (serves up a hosted
error page).  In any case, http://www.meetup.com/ and
https://secure.meetup.com/login/ show highly similar hosted page errors.

Let's see what else we can see on Twitter:
*Lots* of buzz about "meetup.com" being down:
https://twitter.com/search?q=meetup.com%20down
Also, fair amount of buzz on that about no tech news having picked it up
yet, and of meetup.com apparently being under DDoS attack (though I
can't say I've yet spotted the claims of alleged/reported DDoS attacks
being reported from a necessarily reliable and/or authoritative source
... yet).
Still spotting nothing relevant on Google News searches - tried
"meetup.com" long with relevant terms, etc.:
down OR ddos OR dns OR cloudflare
... still nothing ... yet
Internet Storm Center https://isc.sans.edu/
Nothing especially noteworthy at present
Cloudflare itself?  Presently shows all good for today and yesterday,
http://www.cloudflare.com/system-status
shows all good most recent 6 days including today, except a pair of
issues in two non-US locations on 2014-02-25.
Cloudeflare shows some CloudFlareStatus tweets in the last couple days,
but nothing that immediately appears to be relevant to meetup.com.
Checking a bit more on searches ...
found:
https://twitter.com/intent/user?screen_name=Meetup
So, ... does appear meetup.com claims to be and likely is under DDoS
attack.  And some bits there would also explain the recent DNS changes,
notably:
Meetup @Meetup
We are implementing a solution, but can't yet publish a time estimate  
of when Meetup services will be available globally.
about 13 hours ago
and more recently:
Meetup @Meetup
Meetup is still under a DDoS attack. Our team is fighting back, but  
unfortunately, we're still seeing intermittent outages.
about 2 hours ago
Looks like this was Meetup's first tweet on the matter:
Meetup Support #@meetup_support  22h
Meetup is down for the moment. Our team is working on fixing it right  
now. Sorry for the inconvenience!
7:44 AM - 27 Feb 2014

So, in brief summary, looks like meetup.com. had issues, presumably from
DDoS attack.  They did some (disruptive) DNS changes to attempt to fix
the issue (apparently use or change content delivery providers (CDN))
http://en.wikipedia.org/wiki/Content_delivery_network
... but that appears to not yet suffice - even after the DNS issues
corrected themselves over time (max 48 hours to pick up changes due to
TTL on authority records for meetup.com. NS records - had they done or
been able to also change some other records on the older NS server,
that might have possibly reduced the time for the DNS issue, but not
necessarily all that much, and until issue with (apparently new) CDN
provider (between then and meetup.com) is worked out, the DNS being
"all better" still wasn't enough to get them effectively up and running
again).  So, why still down?  Guestimate, CDN & DDoS - CDN can
typically very well withstand DDoS.  However, CDN also has to talk to
back-end ("origin") servers for the hosted site - if those aren't
protected from attack, and/or if the CDN is being access in ways that
most or all traffic must be passed back to origin site (e.g. things
that can't be cached, such as submitting updates to a page - e.g. add
comment) - then the DDoS attack impacts may be mostly passed through
back to origin site, overwhelming those.  Anyway, maybe when it's "all
better now", Meetup.com will have a nice technical write-up of it
somewhere.




More information about the buug mailing list