IT directors failing to assess risk of human error in datacentre systems

 

Best Practice a must for Datacentres to prevent avoidable failures


Companies are investing hundreds of thousands of pounds in high-availability systems for datacentres but are failing to follow best practice maintenance procedures to avoid having a single point of failure. Even though the IT within data-centre sites can offer 99.99% availability and no single point of failure, IT directors are failing to assess the risk of human error in mechanical, electrical and IT systems, said Mick Dalton, chairman of the British Institute of Facilities Managers.

Dalton, who is also group operations director at Global Switch, has seen examples of downtime arising from someone simply plugging in a device not approved for the datacentre. For example, datacentres have been brought down by incidents as mundane as a janitor plugging in a vacuum cleaner and IT staff plugging in radios with faulty power cables. To prevent such problems occurring, Dalton recommended that IT directors ensure T-bar power sockets and plugs are used throughout the datacentre. Problems can also arise due to poor maintenance practices. Key-source, an electrical engineering consultancy specialising in datacentres, said “Essential, ongoing service and maintenance often falls short of the rigorous regime required to deliver high levels of availability. IT directors should also check the reliability of the datacentres' water cooling system" said Mark Seymour, a director at Future Facilities, which provides thermodynamics modelling tools to identify hotspots in cooling systems. This is because an inability to cool hot equipment will cause servers to shut down due to 'thermal shock'.

All this shows that IT directors need to assess IT, electrical and mechanical systems, people and processes in order to have a thorough understanding of where points of failure can occur and how the risks can be minimised. IDC analyst Claus Egge said, “The only way to ensure that fall-back plans work is to test them.  He noted that many IT sites do not test.  Even if a site tests regularly, it may not have tested the exact events that cause downtime.

Over the summer there have been several high-profile datacentre glitches, resulting in downtime. Co-location provider Telehouse suffered an outage on 17 August at its London Docklands datacentre due to a phased supply that had burnt through. CSC had a datacentre power failure at its Maidstone facility on 30 July, which led to the disruption of NHS IT services in the North West and West Midlands. The glitch at the CSC datacentre was initially caused by maintenance work on the uninterruptible power supply (UPS) system leading to a short circuit. The circuit breakers tripped causing a total loss of power that lasted for 45 minutes.

Key Points

Many firms are failing to follow best practice datacentre maintenance procedures.  Risk of human error in mechanical, electrical and IT systems is being overlooked. Datacentres have been brought down by routine maintenance - IT directors need to assess thier mechanical and electrical risks to understand why Datacentres Fail.  Common failures are:

Inadequate risk assessments and method statements that do not address and mitigate the real potential of downtime.

Limited UPS battery testing and performance monitoring to measure gradual failure of components

Incorrect circuit breaker settings

Poor co-ordination between datacentre managers and facilities management and IT.  

Lack of planning for multiple concurrent events or cascade failure.

Insufficient live testing of real failure situations

Not enough built-in redundancy

Inexperienced, unsupervised engineers carrying out essential service and maintenance

Source: Keysource

If you would like to know more about how your organisation can get involved and benefit from working with the Continuity Forum, please email us HERE! or call on + 44 (0) 208 993 1599.