Turing's Man Blog

Data center infrastructure maintenance standards

Bookmark and Share

The following statement can be read in "Tiered Infrastructure Maintenance Standards (TIMS) For Mission-Critical Environments" white paper by Lee Technologies: […] Business and government have invested billions in pursuit of "buying" uptime. However, an unavoidable fact of life remains: While beginning with a strong foundation is crucial, ongoing maintenance will make the difference between success and failure. And the more critical the mission, the more intense the maintenance program needed. […] Therefore, as we take care about requirements defined by Uptime Tier Classification or "ANSI/TIA-942. Telecommunications Infrastructure Standard for Data Centers", especially when it comes to design and construction stage of our data center facilities, we shouldn't forget that uptime is also a direct result of wisely balanced maintenance programs. It seems obvious, but... How to measure our currently implemented maintenance standards to match them against Uptime Institute or ANSI/TIA-942 tiers? How to define appropriate maintenance program individually for given data center with all its limitations and designed features? TIMS – Tiered Infrastructure Maintenance Standards seem to be the one and only formalized, industrial proposal for now, on which we can rely, so let's make a short review for our further inspirations.

 

Honestly, there were no formalized maintenance standards in the data center industry defined before 2006, when Lee Technologies introduced the concept of "Tiered Infrastructure Maintenance Standards (TIMS) For Mission-Critical Environments" with their white-paper, written by Bob Woolley and co-authored by Mike Hagan.

Of course, this doesn't mean there were no good maintenance programs implemented in data centers at all. There were and there are many good examples in the industry, however, when it comes to uptime and reliability discussions, data center professionals used to talk more about design considerations, decisions, redundancy, Uptime or ANSI/TIA-942 tiers etc. For some unknown reason, solid maintenance is rarely mentioned, but wait – this is not something to be ashamed of or something less important than the data center infrastructure itself. This is the other side of the same coin – both areas are equally important, can be financially measured, justified, assessed and only both of them can guarantee required uptime and reliability. There is no reliable data center infrastructure without well-balanced maintenance. Although, some of us have a great data center infrastructure, designed with reliability in-mind (let's say – Tier IV), invested tremendous amounts of money into construction, because they know that reliable infrastructure means safe business, but when it comes to the daily maintenance and operations, something is wrong. I don't want to go into drastic scenarios, where there are no maintenance programs – these are marginal situations. However, many companies are acting like "run to fail" strategy was their method of taking operational risk seriously. I don't accept such approach, hence it will be left without any additional afterthoughts and skipped right here. I just mean that on many occasions we want to have optimal maintenance program for our wisely designed data center infrastructure, but we don't know how to define the rules and requirements to balance and match both areas properly. We want a program which will suit to our site and all its features or limitations – like available redundancy level on critical components and systems, required uptime, tier level etc. – but not only we have issues with categorization of which parts of our maintenance program are considered critical, we also don't know where to look for recommendations or formalized guidelines to solve this challenge.

Moreover, usually we are hardly able to present the importance of proper maintenance plan to our management, which complicates the things badly when it comes to the budgeting side of data center operations. We must agree – not only solid design and state-of-the-art infrastructure make data center reliable (and our business clients safe), but also properly defined maintenance program, based on standards and requirements that suit and support available infrastructure. If we are able to explain the data center infrastructure design decisions from the uptime and reliability perspective to have financial approval for our ideas, we must be able to do the same thing with the maintenance standards. Therefore, our maintenance program must be fit to the data center infrastructure capabilities, measurable and has to be based on formalized rules. Where can we find these rules or at least guidelines and inspirations to define our own? TIMS can help us very much.

As per analogy to Uptime Institute and ANSI/TIA-942 tiers for data center infrastructure reliability, Lee Technologies defined four tiers for maintenance standards:

  • TIMS-1: Run to Fail
  • TIMS-2: Unstructured
  • TIMS-3: Structured
  • TIMS-4: Facilitated

Let's now cite from the original white paper here to have all four TIMS covered:

 

[…]

TIMS-1 Run to Fail

This level of service reflects the old adage, "If it isn't broken, don't fix it." Maintenance is purely reactive at this level; when equipment fails, a technician is summoned to perform the repair. In areas where the system has redundancy, there may be little or no effect on the critical load for an isolated failure. The lack of a preventive maintenance program, however, will increase the likelihood of simultaneous failures, which can take down even redundant systems.

Operating at TIMS-1 implies that the perceived cost of an outage is low compared with the cost of preventative maintenance. And in a time of tight IT budgets, deferring maintenance is often viewed as an easy way to cut costs. But any perceived short-term savings in maintenance costs will likely be overshadowed in the long term by more costly outages and expensive repairs.

A lack of system redundancy may also invoke a run-to-fail strategy, where maintenance on a non-redundant component would necessitate removing a portion of the critical load from service. Ironically, the same lack of redundancy will guarantee an unplanned outage when (not if) a failure occurs.

 

TIMS-2 Unstructured Maintenance

TIMS-2 maintenance is characterized by the performance of routine preventative maintenance tasks without an overlying set of processes and procedures to ensure effectiveness and predictability. The fact that it is commonly performed by qualified manufacturer's service representatives or trusted in-house technical staff can create a false sense of security. Even qualified personnel can make mistakes or focus too intently on individual system components without considering the system as a whole. This approach may deliver adequate results in some environments, but it does not meet the expectations of mission-critical data centers. Unfortunately, this level of service is the industry norm. Service contracts for preventative maintenance are commonly low bid with the difference being recovered on follow-up corrective maintenance work, which is lucrative.

Simply following manufacturer's recommendations is no guarantee that all necessary steps are being taken to maximize availability. If the maintenance program lacks a detailed scope of work for each piece of equipment that factors in system interdependencies, chances are that important steps are being neglected. If methods of procedure (MOPs) are not employed on critical systems to detail each step in the maintenance process, the risk of human error occurring during maintenance events is elevated.

A common characteristic of Unstructured Maintenance is an over-reliance on individual effort. It is reassuring to rely on a trusted individual who has been providing maintenance services for years, but this creates a high degree of risk when an organization's facility-maintenance knowledge resides inside the head of individual technicians, who are susceptible to making mistakes no matter how experienced.

Unstructured, under-documented maintenance programs create an environment in which equipment failure is tolerated and the risk of human error is elevated.

 

TIMS-3 Structured Maintenance

Structured Maintenance is designed to maximize uptime by removing guesswork and minimizing the negative effects of human error. TIMS-3-level maintenance is a complicated task that requires discipline and experience to execute. Each component of the maintenance process is closely controlled; policies are established to control how information is gathered, acted upon and recorded, precisely managing how and when work is performed. Identifying and training qualified personnel is part of a formal program, as is supervision and performance evaluation.

Structured Maintenance is an extremely proactive process that unites best practices for each maintenance element, integrating them into a program that is more than the sum of its components. The goal is to systematically eliminate variables that can introduce errors.

Structured Maintenance programs include a formal staff training program; a document library that includes a scope of service and standard operating procedures (SOPs) for all site equipment; a change management program that uses methods of procedure (MOPs) for maintenance activities along with a formal work process; a strong vendor management program; rigorous quality control procedures and specialized support systems such as a computerized maintenance management system (CMMS) and electronic document management system (EDMS); and 24/7 on-site staffing.

Importantly, a facility with a high Uptime Institute Tier rating is not required to enact a Structured Maintenance program. Rather, the critical systems must simply be maintained to the program standards. In the event that concurrent maintenance is not possible, data center managers may have to organize a controlled shutdown of some services, but this is significantly better than an unplanned, uncontrolled shutdown that was preventable.

 

TIMS-4 Facilitated Maintenance

Facilitated Maintenance is the highest level of maintenance service. It combines a Structured Maintenance program with a system topology that facilitates maintenance by providing multiple power and cooling distribution paths with redundant components. Such a design allows individual pieces of equipment to be isolated and maintained without a disruption in services. Another important component is a building management system (BMS), which continually monitors the critical infrastructure, trends equipment performance, alerts operators when conditions fall outside allowable parameters and allows automated control of equipment sequencing.

Data center operations achieve the highest possible level of reliability for their assets when Structured Maintenance is performed in this environment. Automated systems eliminate much of the risk of human error and can respond more quickly and appropriately to sudden changes. Continuous monitoring of the critical systems, and the ability to trend specific operating parameters, facilitates predictive maintenance of critical systems. Operating in a Facilitated Maintenance model enables managers to easily isolate redundant system components for comprehensive testing and maintenance, greatly increasing reliability while minimizing the risk of downtime.

[…]

 

Conclusion

We've read all the basic concepts of Tiered Infrastructure Maintenance Standards. Now we can make some conclusions. First of all, most of data centers are maintained at the TIMS 2 level (this was mentioned to be the industry standard) and since 2006 not too much has been changed until today. Secondly, not only qualified personnel, maintenance plans, SOPs and MOPs are required to raise the standards, but we need a set of proper tools, including support from dedicated software tools (BMS, CMMS, EDMS). Obviously, this requirement for software tools can be additionally fulfilled with DCIM systems implementation, but these systems are not general purpose all-in-one solutions – their functionality is well defined, yet not universal. There are no silver bullets in this world and – honestly – in many cases we need a good BMS, CMMS, EDMS more than complex DCIM. Thirdly, we shouldn't hide maintenance and focus on the infrastructure design and efficiency only. Both areas are the foundations for reliability of our data centers, both are tightly connected and cannot exist without each other.

This will be a huge simplification, proposed only to strengthen the key concepts covered by this article, but we can express the final thought with an equation to summarize the whole discussion in a way easy to remember:

 

infrastructure (measured in Tiers) + maintenance (measured in TIMS) = data center operational reliability (measured in uptime [%])

  

Bookmark and Share

Add comment


Security code
Refresh