ATO systems outage audit prompts IT supplier SLA scrutiny

Audit points to "inadequacies" in business continuity management planning relating to critical infrastructure

The service level agreements (SLAs) the Australian Taxation Office (ATO) has with its major external IT providers have come under the spotlight more than a year after the agency’s “unprecedented” failure of storage hardware in late 2016.

The ATO was struck by widespread systems outages after 3PAR storage area network (SAN) hardware supplied by Hewlett Packard Enterprise (HPE) in late 2015 unexpectedly failed in December 2016.

Now, a report into the initial outage – and subsequent associated outages – released by the Australian National Audit Office (ANAO) on 20 February, has recommended that Australia’s tax collector include tolerances in its IT services contracts that “align with service standards associated with [IT] systems, where possible”.

“With the major ICT service contracts scheduled to be renegotiated in 2018, the ATO has an opportunity to align service measures across its ICT contracts and also align service standards with the outage tolerances in its ICT service contracts,” the ANAO said in its report, Unscheduled taxation system outages.

The recommendation echoes some of the findings of earlier investigations into the systems failure, with the ATO’s own internal audit into its contract and relationship with DXC Technology examining whether any aspects of the arrangements exceeded ATO’s risk tolerances.

DXC Technology was the external supplier left holding the ATO’s mammoth centralised computing contract, which covers the provision of certain storage infrastructure, seven years after it was originally awarded to HPE in December 2010.

In April 2017, when HPE’s Enterprise Services business merged with CSC Australia, the contract, worth approximately $160 million per year, came under the auspices of the resulting entity, being DXC Technology.

The internal audit found that while there were no immediate issues apparent in contractual arrangements, there were broader issues surrounding the extent of strategic alignment of the contracted IT service providers’ offerings with ATO business objectives.

That report made several recommendations in this regard, and noted that, “at an entity-level, greater definition is required as to how the ATO engages with key vendors, supported by greater analysis and monitoring of arrangements, including periodic reporting to the ATO Executive”.

“In this way, the ATO will better define and achieve strategic value from vendors, with better visibility and control of the breadth of, and reliance upon, vendor arrangements,” the ATO’s HPE Review Product, Services and Relationships Report from July 2017, stated.

Broadly, the ANAO’s report suggested that the ATO’s responses to the system failures and unscheduled outages were largely effective, “despite inadequacies in business continuity management planning relating to critical infrastructure”.

In addition to calling for the ATO to build aligned systems outage tolerances into its contracted SLAs with external vendors, the ANAO made two other recommendations.

The Audit Office recommended that the ATO update its business continuity management, IT service continuity management and risk management frameworks to improve and better integrate the identification and treatment of risks to critical infrastructure that may lead to system failures.

“The December 2016 and February 2017 incidents highlight that the ATO did not have a sufficient level of understanding of system failure risks,” the ANAO’s report stated. “The ATO’s risk management and BCM [business continuity management] processes did not include an assessment of risks associated with storage area networks, which were a potential single point of failure.

“Moreover, BCM processes were limited in planning for critical infrastructure and ICT system failure to the data centres.

“As a consequence, the ATO – including DXC and Leidos – were not prepared for the possibility of complete system failure caused by storage failure."

Read more on the next page...

Page Break

The ANAO also recommended that the ATO determine the level of availability of services associated with IT systems to include in service standards and subsequently report performance against those standards.

For its part, the ATO has agreed to all three recommendations, with the agency’s CIO, Ramez Katf saying that a dedicated program of work to enhance the ATO’s IT systems’ resilience, performance and stability was already underway.

“We will focus on improving our IT design and governance, further strengthening our cyber security posture and improving the technology used by ATO staff to ensure they have the right tools to do their job,” he said.

SAN migration project

The ongoing program comes after the ATO approved a new storage strategy in February 2017 that was proposed by DXC Technology.

According to the ANAO report, the strategy proposed by DXC involved the migration of all data off the failed storage array, and the replacement of the storage network devices with new XP7 storage components at both the agency’s Sydney Data Centre and the Western Sydney Data Centre.

The work to replace the previous 3PAR SAN was carried out in the ATO’s Rebuild Program under the so-called SAN migration project, and included replacing the damaged disk drives, replacing all optical cables, updating the firmware and independent testing.

The ANAO said that DXC Technology decommissioned the failed 3PAR SAN supporting the production environment by July 2017. The SAN migration was completed as a phased approach once the new XP7 SAN was installed. A final report and certification was issued in June 2017.

As previously reported, the defective 3PAR SAN was then sent to HPE laboratories in the United States for forensic analysis into the root cause of the failed storage drives.

A report from DXC Technology is expected in early 2018.

“The SAN Migration project was to ‘stand up’ dual XP7 SANs in the Sydney and the Western Sydney Data Centres,” the report stated.

“The storage environment includes replicated storage arrays across data centres, a feature absent from the ‘original’ 3PAR SAN-supported environment. The updated storage configuration will be monitored for capacity and performance,” it said.

The ANAO also said that, according to DXC Technology, the dual XP7 storage configuration should provide better performance than the prior infrastructure.

Specifically, in the event a storage drive may be failing, it is designed to not lock the troubled drive, but instead use one of the spare storage drives to rebuild itself.

“The storage array will continue to write to the disk until it is completely rebuilt on the spare drive and then lock out the failing drive,” the ANAO said.

"This storage configuration provides the ATO with the capability to run its [IT] systems even if two storage disks fail simultaneously."

This feature is understandably seen as an important factor given that the system recovery tools used to restore IT services – data management, system monitoring and backup/restore – were in the same data centre on the affected SAN that went down in December 2016.

The system failure meant that those tools were unavailable, and that there were no backup or redundant system recovery tools available on other IT systems to detect and analyse the incident, or to support efforts to recover and restore services.

The ATO said it has since been working to ensure its IT service continuity management has focused on IT technical architecture and design and operational risk management to strengthen the identification and treatment of risks to critical IT infrastructure that may lead to system failures.