The PricewaterhouseCoopers (PwC) report into the Australian Taxation Office (ATO) systems failure found that the SAN design implemented by the service provider did not leverage 'out of the box' automated technical resilience, and data and system recovery features.
In pre-incident observations, PwC said that decisions made in the design and build phase led to a resilience posture lower than required, to cater for the impact of technical failures.
Professional services industry heavyweight, PwC, was appointed in January 2017 to conduct an independent review of the outage caused by the failure of two HPE 3Par storage area networks (SAN), which began in December 2016.
"Specific design and build characteristics adopted by the service provider for the SAN and enabling infrastructure – some of which were geared towards better performance versus higher resilience – exacerbated the scale of the failure and compromised response and recovery capabilities," said the report.
According to the report the scale, scope and duration of the outage were caused by several issues, including the potential for an incident of that nature and scale, had not been explicitly considered by the ATO or its SAN service provider.
"This resulted in readiness levels in terms of technical resilience and recovery capability (as designed and implemented in the specific ATO SAN instance) insufficient to cater for the scale and scope of the technical failure," the report said.
PwC has made several recommendations addressing the issues found in its review. These included replacing the current physical infrastructure; re-architect and reconfigure a new SAN with a re-balanced focus on resilience and performance; provide a centralised logging capability; migrate management, monitoring and recovery systems to separate and non-dependent storage; pursue, with priority, unanswered technical questions that pose an ongoing risk in relation to the use of 3Par.
In order to strengthen the ATO’s governance, risk and response capabilities, PwC recommended to augment current ATO technical knowledge and expertise pertaining to infrastructure design and implementation planning; support design governance processes; establish a permanent and dedicated resilience ‘run’ function as part of the business continuity management capability, and to provide an enterprise-wide focus on preparing for, testing, and responding to disruptive events.
The report also outlined that the ATO should consolidate, streamline, update, and simplify existing business continuity management documentation to clearly articulate the relationship between, and accountability for, business continuity, disaster recovery, and resilience planning.
The ATO said it was taking action in response to the PwC report.
"The ATO is well-advanced in implementing the recommendations of both reports, including fixing irritants and enhancing systems performance, refreshing the Tax and BAS Agent Portal to better meet the needs of the tax profession and improving our IT design and governance," the ATO said.
In June 2017, the ATO released a report into the outage that revealed that analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage.
HPE took some actions at the time, including the replacement of specific cables, but errors continued to be reported indicating the actions did not resolve the potential SAN stability risk.