IT Infrastructure failing as if the past two decades never happened - Part 2

In Part 1 of this series, we examined recent data center outages and the reasons why these “cautionary tales” came to pass. Now, let’s discuss practical tips for minimizing the risk of outages in business-critical infrastructure.

Getting past misconceptions

Human error and/or equipment failure is often cited as the root cause of many engineering system outages, but most of the time, these elements don’t actually cause major disasters on their own. Rather, they are symptoms of a larger issue – poor management and operations practices.

Leadership decisions and priorities that result in a lack of adequate staffing and training, an organizational culture that becomes dominated by “fire drills,” or budget cuts that reduce necessary maintenance, could result in pervasive failures that flow from the top down.

Article continues below

Although front-line operator error may sometimes appear to cause an incident, a single mistake (just like a single data center component failure) isn’t typically enough to bring a robust complex system to its knees – unless the system is already teetering on the edge of critical failure as a result of numerous underlying risk factors.

It’s true that vulnerabilities are present within even the best-designed data centers. Companies with complex IT systems combat the risk of failure with multiple layers of protection and backup. So again, when IT failures take place, it’s not due to a lack of backup systems or any one issue in particular, it’s an indication of poor management.

Catastrophic data center incidents like the ones we saw in 2017 are avoidable if organizations design their infrastructure up to industry standards, with redundancy and other preventative measures baked in, and implement stringent management and operations best practices.

Every business should conduct thorough failure analyses and apply the lessons learned when developing and refining their program, in order for business-critical facilities to become resilient and successful over the long term. Every organization’s responsiveness, familiarity, and adherence to documented procedures are key to evaluating performance.

Practical considerations for minimizing risk

Throughout the past 20 years, Uptime Institute has delivered operations assessments across hundreds of data center facilities and has identified key management shortfalls that increase risk.

Many data center programs – even rigorous operations that have been successful – are subject to various risks and can be improved through continuous assessment and development.

Take a moment to review your program with an objective eye; if you can answer yes to any of the following questions, you may be experiencing a crisis in management rigor:

· Are data center staff voice mail boxes full, emails not responded to, email inbox size limit exceeded?

· Are critical meetings missed or routinely cancelled?

· Does your data center team report a lack of time for training?

· Are there any whisperings about a potential shortage of qualified staff?

· Are certain team members performing work outside their competency?

· Does your staff experience high personnel turnover?

· Has maintenance exceeded its budget? How about energy cost estimates?

· Does the back of your servers or cable trays look like a spaghetti pot blew up?

· Does your equipment and cabling lack clear labelling systems?

It can be relatively easy to determine other underlying risk factors that are being left untended by management. Walk through your facility and ask yourself these questions to ensure the appropriate processes and documentation are in place:

· Are there any combustible materials on the raised floor, in the battery room, or electrical rooms? All incoming equipment should be stripped of packaging outside of critical space.

· Are unrelated items—office furniture, shelving units, tools—stored in critical space? This is a fire, safety and contamination issue.

· Do any fire extinguishers on the premises have out-of-date tags?

· When was the last time you reviewed housekeeping policies and procedural documentation?

· If the facility operates a raised floor, what is the condition of underfloor plenum? This area should be cleaned regularly — ask to see the schedule.

· How many employees have access to the critical space? Does your organization even have an access policy for staff?

· Are non-vetted individuals being allowed in critical areas? Ask to see the vendor check-in and training requirements; non-vetted individuals should never be allowed.

· Are panels, switchboards, and valves labelled to indicate “normal” operating positions?

· Is arc ash labelling installed on all panels and PDUs?

For over a decade, data center cooling practices have called for air flow isolation—cool air delivered to the front of a rack of IT equipment and hot air exhausted out the back.

In a raised floor environment, rows of equipment are typically arranged in a Hot Aisle – Cold Aisle configuration, in which perforated tiles deliver cool air to the cold aisle or server intakes.

When reviewing your organization’s cooling procedures, consider the following indicators of poor bypass air flow management. These factors can result in heightened risk, cooling inefficiencies, wasted money and poor adherence to key management best practices:

· There are grated or perforated panels in the Hot Aisle.

· There are unsealed cutouts in the raised floor.

· There are uncovered gaps in the racks between IT hardware.

Here are several other key steps that can help identify elements of your data center that constitute poor management procedures and increased risk of downtime:

· Ask to see records and schedules for maintenance activities on batteries, engine generators, and mechanical systems.

· Review staffing documentation—overtime rates greater than 10 percent can lead to an increase in human error, which can increase the likelihood of an outage. Are roles and responsibilities documented? Are qualifications listed?

· Ask to see list of preventive maintenance activities. Are the activities fully-scripted? What is the quality control process?

· Find out who keeps critical documentation on equipment, including warranty info, maintenance records, and performance data.

· Revisit your process for maintaining the reference library (staffing, equipment, maintenance, procedures, and scripts).

· Analyze your team’s training records, annual budget, and time allocation.

Organizations are continuing to adopt various new IT models to deal with the ever-growing reliance on technology and data in modern business. As such, availability has never been more important.

While it’s virtually impossible for an organization’s site processes, procedures, and site culture to be perfect, successful IT infrastructure teams remain hyper-focused on preventing failure.

This means staying vigilant at all times and constantly addressing (and readdressing) the considerations listed above to pinpoint hidden vulnerabilities in your IT operations, which can serve as the basis for productive conversations about change and improvement. The fact that your facility hasn’t experienced an incident yet doesn’t mean it’s immune.

A solid commitment to management and operations excellence can have a tremendous impact on the performance of your IT infrastructure, so ask the hard questions and cover all your bases to eliminate preventable outages.

Lee Kirby is the president of Uptime Institute
Matt Stansberry is the senior director of content & publications at Uptime Institute

Check out the best dedicated servers

Lee Kirby has more than 30 years global experience in the private and public sectors and has successfully led several technology start-ups and turn-arounds as well as built and run world-class global operations. Lee is a trusted adviser for various organizations in the data center sector and has provided interim leadership to emerging and transforming technology companies. He retired as of President of Uptime Institute October 2018, where he provided thought leadership, global services strategy, new product development and strategic marketing. In addition to his many years as a successful technology industry leader, he masterfully balanced a successful military career over 36 years (Ret. Colonel).

Getting past misconceptions

Practical considerations for minimizing risk

Useful links