Optus CEO Kelly Bayer Rosmarin and head of networks Lambo Kanagaratnam faced a senate hearing on 17 November into its nation-wide outage on 8 November.
“We do everything we can to prevent these types of outages. We have multiple layers of redundancy, both geographical redundancy, physical redundancy, power redundancy; it is highly unusual for all of the different networks which are segregated to be down at the same time,” Rosmarin told a senate inquiry.
Kanagaratnam said it did a network outage exercise in October, but it wasn't for a full outage on the network.
“We didn't have a plan in place for that specific scale of outage. It was unexpected,” he said. “We have high levels of redundancy and it's not something that we expect to happen.”
Rosmarin said its plans cater for a wide range of possible network outages across different networks and geographies and localised to whether they're at the core or the periphery.
“So, I believe our plans are structured in a way that really empowers the right group of people to take the steps necessary to restore communication to customers and escalate through the chain of command as required,” she said.
The Optus CEO detailed the root cause of the issue, which was its 90 Cisco PE routers that hit a failsafe mechanism where each one of them independently shut down due to an upgrade on the international peering network at one of the Singtel internet exchanges (STiX) in North America.
“Our network has to be designed to cope with diverting from where the upgrade is to an alternative link,” she explained. “What was coming through that link needed to be diverted to another link which happened to be configured differently and then propagated through our network in a way that triggered these failsafes in each of the different routers.”
In its submission to the Senate, Optus detailed that during the upgrade, the network received changes in routing information from an alternate Singtel peering router.
These routing changes were propagated through multiple layers of its IP core network. As a result, at around 4:05am, the pre-set safety limits on a significant number of Optus network routers were exceeded. Although the software upgrade resulted in the change in routing information, it was not the cause of the incident, Optus said.
During the outage, 228 emergency calls to triple zero were unable to go through.
Additionally, Rosmarin deflected questions about the current status of tenure with the telco.