ATO remains stumped by failed SAN cable conundrum
- 23 March, 2018 14:19
The Australian Taxation Office (ATO) is still trying to pin down whether the cabling issue contributing to the “unprecedented” failure of its Hewlett Packard Enterprise (HPE)-upgraded storage hardware in late 2016 was the result of faulty hardware or installation problems.
The hardware trouble struck the ATO in December 2016 when 3PAR storage area network (SAN) hardware that had been upgraded in November 2015 by HPE failed, resulting in widespread outages among many of the ATO’s systems. This was followed by a series of subsequent, albeit smaller outages.
In June last year, the ATO released its own report into the outages, revealing that analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage.
The report revealed that the ATO considered that stressed fibre optic cabling issues were a major contributor to the initial outage – regardless of the actions taken by the ATO’s external IT partners, which included the replacement of specific cables.
The report also revealed that a second outage that hit the ATO’s systems on 2 February was caused by further issues associated with the cables.
The second outage followed remedial work by HPE on the SAN fibre optic cables, according to the ATO. During one cable replacement exercise, however, the agency was informed that data cards attached to the SAN had been dislodged.
Now, the agency’s CIO, Ramez Katf, has told the Parliamentary Committee investigating the digital delivery of government services that it is still unknown whether the cabling issues arose from problems with the cables themselves or if the fault lay at the hands of the cables’ installers.
“For every one of our disc arrays, we had two cables going to it, the reality was…we have not received a…definitive answer as to whether the cables were defective themselves, or the way they were installed was [the] defect,” Katf told the Parliamentary Committee during a public hearing on 23 March. “But, it actually was pervasive across the entire array.
“We had a situation where a single cable failed, and the other one continued, but in this instance, multiple cables failed and created a problem across the eight of our 200-and-something disks, which impacted all of our systems,” he said.
The comments came as Katf detailed how the system ultimately went down despite being designed to be resilient to failure, with failover and resilience capabilities.
“We actually believed that we had the resilience capability, the failover capability built in to that technology, and the reality is that piece of hardware failed in that dimension, in that it should have given us that capability to failover,” Katf said.
“The construct we went into for that particular piece of hardware was to avoid a single point if failure. And, in essence, the nature of the circumstances of that event actually created that problem for us.
“So, that resilience factor was built in, but failed,” he said.
The failure of the hardware ultimately saw HPE strike a settlement with the ATO. Since the initial outage, however, HP’s Enterprise Services business, which was handling the project, was spun out of HPE and merged with fellow integrator, CSC, resulting in DXC Technology, which was left with responsibility of the work involved in mopping up the fallout from the outages.
“The settlement recoups key costs incurred by the ATO, and provides additional and higher grade IT equipment giving the ATO a world-class storage network,” Australian Commissioner of Taxation, Chris Jordan, told Senate Estimates committee members on 30 May last year.
While the initial outage occurred under HPE’s watch, the existing contract with the ATO under which the storage hardware was maintained, as well as a $195 million extension awarded late last year, now belongs in the hands of DXC Technology.
Katf told the Parliamentary Committee that the penalty imposed on HPE (now DXC Technology) was appropriate, given the terms laid out in the service level agreements it had struck with the integrator in its contract.
At the same time, the other major integrator contracted by the ATO, Leidos, escaped from any penalties related to the systems failure, as it maintained its SLAs under its contract with the agency, according to Katf.
“We went through a very rigorous evaluation of both, including external advice from legal counsel around whether they breached their obligation, and we believed that they did not, and so we did not enforce any penalty,” Katf said.
“I know the parameters of their [Leidos’] contract, and yes I do believe it was appropriate. I think their responsibilities did not traverse to the point where they breached any of their obligations.
“And that’s exactly why we did with DXC, because we were very clear that they did not meet their service levels,” he said.