After spending almost a week working with Hewlett Packard Enterprise (HPE) trying to restore services following what it is calling the “worst unplanned system outage in recent memory,” the Australian Taxation Office (ATO) is set to commission an investigation into what happened.
Trouble first struck the public agency on 12 December, after storage hardware that had been upgraded in November 2015 by HPE experienced an “unprecedented” failure. The problems with the HPE SAN hardware were compounded by the subsequent failure the ATO’s back-up systems.
The issues saw some of the ATO’s core internal systems and public-facing services go down for days, with some systems still being brought online almost a week after the initial incident.
In a statement issued on 16 December, Australia’s Commissioner of Taxation, Chris Jordan, said that the subsequent restoration and resumption of data and services was very complex and time consuming, due largely to the failure of the back-up arrangements that were in place.
“ATO staff and people at HPE have worked around the clock over the last week to bring systems back online progressively,” Jordan said. “Our website, the Tax Agent Portal, BAS Portal and Business Portals are up and running and we are seeing continued improvements in their functionality.
“We are continuing to work on the stabilisation of ATO Online services,” he said. “We are realistic that there may be some intermittent performance issues in the next couple of days as the full restoration process proceeds.
“We, and HPE, will continue to work on the stability and performance of all of our systems and we will have staff working over the weekend to catch-up on any backlog of work,” he said.
Now, as the ATO continues to work with HPE to restore services, Jordan said that the agency will fully investigate the events of the past week, with a comprehensive and independent review to help answer questions such as exactly what happened, how and why, and what measures need to be taken to avoid a similar situation in the future.
“The review will be conducted by an independent expert who will determine the nature of the failure(s) and their root cause(s), the adequacy of back-up and contingency arrangements, and the likelihood of recurrence,” he said.
As yet, no blame has been laid at the feet of any single party, but the Commissioner has already moved to defend the agency’s internal competencies, saying that “the issues we have experienced this week do not relate to our overall IT capability or skills”.
The last time a government agency suffered a similarly high-profile technology failure, it saw the Australian Bureau of Statistics (ABS)-contracted 2016 Census portal project lead, IBM, bear the brunt of a full-blown Parliamentary inquiry.
Like the ATO’s storage-related outage, the Census portal outage that followed a series of distributed denial of service (DDoS) attacks in August resulted in an essential government portal becoming unusable for a period measurable in days rather than hours.
The end result was a "very substantial financial settlement" between IBM and the government in relation to the estimated $30 million price tag resulting from the Census outage, multiple condemnations by Prime Minister, Malcolm Turnbull, and a blow to the company’s brand equity in Australia.
For its part, the ATO has moved quickly to get to the bottom of its technology woes, and has already outlined the draft scope of review for an independent expert to undertake an end-to-end review into what happened and why, and what needs to happen to ensure the ATO and the community are not exposed to this type of incident in the future.
The independent investigation is set to examine at least seven key areas, including the definitive description of the failure and its root cause, and factors leading to the outage and contributing to its duration, scale, and scope.
It will also look at the adequacy of back-up and contingency strategies and arrangements, why failover to the secondary site did not work, the adequacy of restoration procedures for infrastructure and applications, and whether there is anything unique or unusual in the physical and technical ATO technology infrastructure that suggests a high risk of a similar failure further down the track.
Of particular note to the ATO’s technology partners in general, and HPE in particular, is the agency’s plans to look into the adequacy, speed, and robustness of the critical event response provided by “various vendors and other non-ATO entities”.
As yet, HPE has remained tight-lipped about the role its hardware played in the outage and details about its engagement by the ATO in relation to its work to restore services.
It is understood, however, that the systems outage was reportedly caused by the failure of two new HPE 3Par storage area network (SAN) units acquired by the ATO in 2015, which resulted in the temporary loss of up to one petabyte of data.
At the time of writing, HPE’s official response to the issue had provided little detail about the work it has undertaken with the ATO to restore systems and rectify the outage.
“HPE has taken immediate action to help resolve the hardware issues which have impacted the ATO’s online services, portals and website,” HPE told ARN in a statement.
“This is a top priority for HPE and we continue to manage this closely with our client to ensure that all the systems are restored to functionality as soon as possible,” the company said.