[180947] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Open letter to Level3 concerning the global routing issues on June

daemon@ATHENA.MIT.EDU (Martin Millnert)
Fri Jun 12 11:36:06 2015

X-Original-To: nanog@nanog.org
From: Martin Millnert <millnert@gmail.com>
To: NANOG <nanog@nanog.org>
Date: Fri, 12 Jun 2015 17:32:31 +0200
Errors-To: nanog-bounces@nanog.org


--=-5tZYy8C9NP+TdOKnHDL8
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Dear Level3,

The Internet is a cooperative effort, and it works well only when its
participants take constructive actions to address errors and remedy
problems.
Your position as a major Internet Carrier bestows upon you a certain
degree of responsibility for the correct operation of the Internet all
across (and beyond) the planet. You have many customers. Customers will
always occasionally make mistakes. You as a major Internet Carrier have
a responsibility to limit, not amplify, your customers' mistakes.
Other major carriers implement technical measures that severely limits
the damages from customer mistakes from having global impact.
Other major carriers also implement operational procedures in addition
to technical measures.
In combination, these measures drastically reduce the outage-hours as a
result of customer configuration errors.

At 08:44 UTC on Friday 12th of June, one of your transit customers,
Telekom Malaysia (AS4788) began announcing the full Internet table back
to you, which you accepted and propagated to your peers and customers,
causing global outages for close to 3 hours.
[ https://twitter.com/DynResearch/status/609340592036970496 ]
During this 3 hour window, it appears (from your own service outage
reports) that you did nothing to stop the global Internet outage, but
that Telekom Malaysia themselves eventually resolved it. This lack of
action on your end, and your disregard for the correct operation of the
global Internet is astonishing. These mistakes do not need to happen.
AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
Internet. You accepted multiple hundred thousand prefixes from them - a
max prefix setting would have severely limited the damage. We expect
that these are your practices as well, but they failed. When they do, it
should not take ~3 hours to shut down the session(s).

Many operators, in despair, turned down their peering sessions with you
once it was clear you were causing the outages and no immediate fix was
in sight. This improved the situation for some - but not all did. Had
you deployed proper IRR-filtering to filter the bad announcements the
impact would've been far less critical.

As a direct consequence of your ~3 hours of inaction, as a local
example, Swedish payment terminals were experiencing problems all over
the country. The Swedish economy was directly affected by your inaction.
There were queues when I was buying lunch! Imagine the food rage. The
situation was probably similar at other places around the globe where
people were awake.

Operators around the planet are curious:
  - Did Level3 not detect or understand that it was causing global
Internet outages for ~3 hours?
  - If Level3 did in fact detect or understand it was causing global
Internet outages, why did it not properly and immediately remedy the
situation?
  - What is Level3 going to do to address these questions and begin work
on restoring its credibility as a carrier?

We all understand that mistakes do happen (in applying customer
interface templates, etc.). However the Internet is all too pervasive in
everyday life today for anything but swift action by carriers to remedy
breakage after the fact. It is absolutely not sufficient to let a
customer spend 3 hours to detect and fix a situation like this one. It
is unacceptable that no swift action was taken on your end to limit the
global routing issues you caused.

Sincerely,
Martin Millnert
Member of Internet Community - no carrier / ISP affiliation.=20

--=-5tZYy8C9NP+TdOKnHDL8
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAABAgAGBQJVevuPAAoJEKgraNdZrwxMwPwP/3AKxYSo8E37bm1/FfVN8RX3
qMNDgfbL7F0n/uHERbjPOQ4rD1l0fDJJm2EpFB4CvYt1We2pZl4BU3x3d9EVpm2E
MY+ct6PZXuDfKuFnWSGY4d1ksB962e+N7pt2GrMLfYcByMUnsOhYICSCra24VMzJ
LutBGoz8MeEmMoGqM3kNwiH/5AXEr2TdXts5OHH47oA6cjlvFECVVfwgcPFSlAbB
pdo4sdmJWVFib/FyFC590vZ4DijZcevOGdVRacS5EJowghyAh/63c1xGJl5CDexR
NzEIVrZF2QxVsiUMBIkI5uG6rvSzPmFFsZkmqbR90uKK7rV+f1+R4Ibh+yJmuERR
HpU8mtmtkEPvCPH4GSNR9H1diIata7RIZqzytUuCcQKMJBe41iF3q6Z6FnXgEwH4
yVN95wZtiTTSf0up/lQcaOWj38iB93bnB8atsJUe6B2AD92FQSlASAINmQZSDemi
V64tM+FgMEt3txCFin7c2yaFjjoA8tY8z4LZAXWJXrZ1wLVpeFhflUojAxztuXbh
JIunt0b3pya0fiodDbmq363hA5S2a8tr2DytsNU90CENWj66KerHl4qfx1togJIv
8r9OMQ64se6XYJrt8qS6wvmpVkn4nMIUQY0JEJ2lwTJZOAlGEH+0zI+qK+JkcfT5
sAi0CnCSfTICOLkZsFuA
=hBfC
-----END PGP SIGNATURE-----

--=-5tZYy8C9NP+TdOKnHDL8--


home help back first fref pref prev next nref lref last post