Why was youtube down?


A technical explanation of a similar event that happened in 1996
Adrian Chadd adrian at ucc.gu.uwa.edu.au
Wed Aug 16 22:53:07 WST 2006
Applications of Murphy's Law:
The "AS7007 Incident"
Adrian Chadd

It was an average day in 1997. The Internet, fledging compared to today's
standards. Internet operators (mostly!) trusted one another. SMTP servers
would be open relays; a number of open web proxies and anonymous dialout
servers were available. People were worried about running out of IP space.
Network Operators were worried about the CPU on their routers being
taxed dealing with a full routing table of ~45,000 entries.

Then, suddenly, the internet stopped working. Network Operators everywhere
sprang into action to discover the cause of the lack of traffic.
And there it was. As far as the routing protocols were concerned, the
entire internet existed in one location - some crappy Bay Networks
router in AS7007.

The problem was fixed rather quickly - the misbehaving router was pulled
from the network. But this didn't solve the problem. Routers were still
crashing all over the internet. Where were the announcements coming from?
How could one stop it? Was the Internet, kept running by gaffa tape,
IRC and sushi, finally coming to an end?

Everything settled down a few hours later. Network Operators around the
globe began discussing the impact of this outage and how it could be
prevented. The Internet did fundamentally change - but unlike a lot of
other changes, the general users knew nothing about it.

What is BGP? BGP is the protocol which networks on the internet announce
to other networks two things. That they exist, and which networks can be
reached by them, and learning how to reach the other networks on the Internet.
Routers will receive BGP information, decide upon the "best" path to take to
a destination network and update their routing table.

BGP uses a few metrics to determine the "best" path. The most obvious metric
is the number of networks between them and the destination network - the
"AS Path length". A shorter AS path length is generally better. This isn't
the whole story but as you'll see, it didn't matter.

The other metric is how specific the route is. A more specific route is
preferred over a more general route, regardless of AS path length or any
other metric. So if you see an announcement for 130.95.0.0/16 (ie,
130.95.0.0 -> 130.95.255.255) via path A and an announcement for
130.95.0.0/24 (ie, 130.95.0.0 -> 130.95.0.255) via Path B, traffic destined
to any host inside 130.95.0.0/24 will flow via path B regardless of how
much closer path A is.

So there's this router in AS7007. It learnt the entire internet routing
table via BGP. It began converting most routes into /24s - ie,
routes which covered 256 IP addresses. Somehow, and this part is fuzzy -
it then managed to "leak" this table back into BGP and reannounced to
the entire internet almost every network that was available. Deaggregated
down to /24's. As originating from his AS number.

So the AS path was removed (ie, every network on the internet looked like
it was his) and every announcement was very specific (/24).

So, as far as the routers on the internet was concerned, every network
everywhere could be reached by sending traffic to AS7007.

And, they did. The internet existed at the end of a 45-mbit pipe, connected
to AS7007.

This was rectified quickly. The port was shut off and the announcements
ceased. But the problem didn't go away. Routers kept passing on this
massive 250,000-entry routing table and, in many cases, would then crash.
They'd reboot; re-learn all the routes from a peer, re-distribute them,
and crash again.

Not only that, but routers worked in finite time over links which trasmitted
at a finite data rate with a latency under the speed of light. These
announcements bounced around the internet for hours. Many internet
backbones solved the problem by turning off all their equipment, shutting
off the ports, staging reloads of their equipment, adding route announcement
filters to reject receiving the routes in the first place, and then
turning on their network connectivity.

The aftermath? Network Operators began filtering route announcements
from their peers and customers. At a course level - customers could
only announce networks originating from their AS numbers. At a fine-grained
level - some companies only accepted route announcments matching
certain criteria. This involved first registering your network inside
the RADB - which you would describe your network, the networks you announced
and how you connected to other networks. Most networks did something
in between. Vendors began adding in "magic" into their routers to allow
administrators to control how many announcements a peer could send before
shutting that peer off or ignoring further announcements. The usual talk
of "crytographically signed" data popped up but nothing happened for
a long while.

And the owner of AS7007 was never able to live it down.

Disclaimers:

Much hand-waiving has been done about IP routing here. I could be more specific
but the article would be much, much longer. Email me if you're interested
in a further explanation.

References:

* Someone first noticing what was going on
http://www.merit.edu/mail.archives/nanog/1997-04/msg00340.html

* What happened
http://www.merit.edu/mail.archives/nanog/1997-04/msg00444.html

* "Delayed Internet Routing Convergence"
http://portal.acm.org/citation.cfm?id=347428&dl=ACM&coll=&CFID=15151515&CFTOKEN=6184618

* "Understanding BGP Misconfiguration"
http://citeseer.ist.psu.edu/mahajan02understanding.html

* "BGP Design Principles"
http://www.riverstonenet.com/support/bgp/design/index.htm

Comments

Popular posts from this blog

Impossible