Log data generated by the Hewlett Packard Enterprise (HPE) storage hardware being used by the Australian Taxation Office (ATO) revealed potential issues months before the agency’s systems were hit by last year's massive outage.
The hardware trouble struck the ATO in December last year when an “unprecedented” failure of 3PAR storage area network (SAN) hardware that had been upgraded in November 2015 by HPE resulted in widespread outages among many of the ATO’s systems.
Now, the ATO has released its much-anticipated report into the outage, revealing that analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage.
While HPE and fellow integration partner, DXC Technology, continue to investigate the issues related to the outage, the report reveals that, while HPE had taken some actions in response to the problems flagged by log data, alerts continued to be reported, indicating that the actions did not resolve the potential SAN stability risk.
Specifically, since May 2016, at least 77 events related to components that were observed to fail in the December 2016 failure were logged in the ATO’s incident resolution tool managed by IT contractor, Leidos.
In addition, at least 159 alerts were recorded in SAN device monitoring and management logs, the ATO report stated.
Some actions had been initiated by Leidos and HPE in response to the indicators, including the collation of incidents by Leidos, and some infrastructure maintenance, including the changing of cables on the Sydney SAN by HPE.
Despite these actions, alerts continued to be reported that indicated these actions did not resolve the potential SAN stability risk.
“We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN,” the ATO said in the report.
Ultimately, the ATO said that, the massive outage experienced in December 2016 resulted from the compound impact of several factors, including multiple SAN component failures on the agency’s Sydney SAN, which also involved failures associated with stressed fibre optic cabling.
At this stage of the investigation, the ATO considers that stressed fibre optic cabling issues were a major contributor to this outage – regardless of the actions taken by the ATO’s external IT partners, which included the replacement of specific cables.
Other factors contributing to the failure include subsequent unsuccessful attempts for the system to auto-recover in response to the component failures. Consequently, the SAN was unable to provide read/write services to the applications it supported.
Meanwhile, control, management and monitoring systems being placed “in-band” also played a part, with these systems relying on the same data pathways as the production systems that were supporting impacted services.
The second outage
The report also revealed that a second outage that hit the ATO’s systems on 2 February was caused by further issues associated with the cables.
The second outage followed remedial work by HPE on the SAN fibre optic cables, according to the ATO. Unfortunately, during one cable replacement exercise, the agency was informed that data cards attached to the SAN had been dislodged.
“This caused the 3PAR SAN to act in a similar way to that noted during the December outage,” the ATO said. “This included unsuccessful steps to automatically remediate, followed by a systems shut‑down to preserve data integrity. HPE communicated this Priority 1 incident to us immediately.”
As a result, HPE and the ATO monitored the cables around the clock following the outage, until they were comprehensively replaced between 23 and 26 March.
“We have since been advised that SAN alerts ceased completely once the new fibre optic cables were installed,” the ATO said in its report.
The report also outlines other issues that arose when the initial outage occurred early in the morning on 12 December 2016, revealing that firmware supporting impacted disk drives in the affected SAN prevented those drives from re-booting.
Despite having met ATO-specified conditions for categorisation as a “Priority 1” incident, service provider logs indicated the incident was not escalated to this level until around 7.00am that morning, almost seven hours after the hardware first struck trouble.
Further, system management, configuration, monitoring, and data recovery systems that were relying on the SAN also experienced outage extended the recovery process for some applications.
In addition, the impact of pre-incident design and build decisions were material in extending the time to recover data and bring production and supporting systems online.
The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure, the report said.
The storage hardware build also included “daisy‑chain” cage configuration, which exacerbated the risk of errors spreading across cages as occurred during the incident.
“Although a viable design option at the time of SAN implementation, no evidence has been presented of subsequent options being explored by HPE to mitigate this risk,” the ATO said.
Read more on the next page...