A bit of an unofficial post-mortem on #Optus #outage yesterday. I have no insider knowledge, all I can do is look at what Optus's networking gear told the rest of the world through #BGP, and make some informed guesses based on that.
The problem yesterday started at about 4am, when Optus told the world 'I no longer have any internet connectivity', and 'Do not send any internet traffic to me, at all'. The technical description is that they withdrew ALL of their routes from the #DFZ (Which is "The Internet", as seen by all the core routers that ACTUALLY control the internet).
However, as a precursor at about 3am there was a hint that things weren't perfect, as there was a flurry of changes from Optus to the outside world saying, roughly, 'Something has changed inside my network, but you can still keep sending me stuff'.
Now, as two final bits of possibly relevant information, the default for maximum-prefix on #Cisco #ASR9000 is 1048576 (this number is 'the number of routes that can be accepted by this router'), and MOST IMPORTANTLY the DFZ ("the internet") has about 980,000 routes in it at the moment. That's only 90k odd routes LESS than the default maximum.
I'd be amazed if Optus has less than 100k internal routes that aren't visible to the internet, but are visible internally.
So here's what I think happened. The at 3am, the first core #router was upgraded, and a new config was put in place. This did not join the network correctly, and things were half broken. What SHOULD have happened is that all the changes should have stopped, and either rolled back, or waited for further investigation (the cause being that more than 1mil routes were visible, causing it to shut down)
However, someone decided 'Well, maybe if we upgrade the SECOND one, that'll fix the first one' at 4am. That broke the SECOND one, and took Optus completely off the internet.
(Continued, see next for why this is far worse than it should have been)