Major systems crash at Department of Work and Pensions

A configuration issue between Windows 2000 and XP caused the machines to crash on Monday 22 November and it was not until Friday of that week that all systems were fully functioning. November's crash happened in part because of the power of remote upgrade tools which allow technicians to modify tens of thousands of machines using simple routines at a single terminal.

It is believed that an EDS operator allowed Windows XP to be installed on 40,000 computers instead of the 30 in the trial. Procedures have been strengthened to ensure that a single operator cannot perform a remote upgrade accidentally, EDS said.

Some of the DWP's welfare systems run in part on Fujitsu mainframes that date back more than a decade. In its strategy documents, the DWP said all of its equipment could not be feasibly replaced with fully integrated systems in the short term. The DWP is the largest employer in central government and spends £1bn a year on IT systems and services.

EDS runs most of the IT services and systems for the department under an "Accord" which was awarded under the private finance initiative. Contracts awarded to EDS under Accord have a total value of £4.5bn representing 85% of DWP business by value. But since the Accord was signed in 1999 the DWP has decided to move away from PFI under the umbrella of a series of framework arrangements called "Unity." It said it is also reducing the proportion of business with EDS. 

Key point

A simple error caused this crash, which affected some 40,000 and doubtless millions of clients. One operator instigated the disruption, whilst testing an upgraded Pilot system based around Window XP in a drive to cut costs. Instead of remotely upgrading the 30 systems in the pilot as intended the power of the remote system management tool implemented the upgrade in over 40,000 computers. The simple inclusion of of a few basic BC measures in the pilot project would have assisted in identifying the risk and facilitated measures to prevent a MASS upgrade until the Pilot had been fully tested and the upgrade was ready for deployment. 

Large scale systems rely on various remote management tools to manage the IT environment effectively, but as the DWP ffound the risk to the business of these tools increases of a cascade event from a simple source and organisations need to be very careful to ensure procedures are in place to control when and how they are used.