Ten rules to live by to avoid a data center failure.
Shakespeare could have been speaking as a data center manager when he wrote,
There’s a special providence in the fall of a sparrow. If it be now, ’tis not to come; if it be not to come, it will be now; if it be not now, yet it will come. The readiness is all.
Essentially, if something is fated to happen, it will happen. If not now, then later. If not later, then now. Therefore, it is essential to be ready for when it happens and not if it happens.
Much like Hamlet, a modern data center manager must be ready and prepared to act when there is a critical network failure. Here are ten rules to live by when managing a data center.
1. Have a spare or failover plan for when your equipment fails.
Equipment fails for one of four reasons: dust, heat, misuse or “other”. One finds heavy dust in non-air cleaned environment. One finds fine particulate dust in “cleaned” air facilities. Dust is everywhere. Have a plan to manage this.
Heat is the summation of the ambient temperature and the heat generated by the equipment. Small increases in heat due to blocked or partially blocked airflow or even a small reduction in normal airflow can and often times does produce hardware failure.
Misuse is anything from physical damage to an improper firmware upgrade to incorrect configuration or use. Use “proper care and feeding” rules for all your equipment at the data center.
“Other” could be anything from a blown capacitor to a bad chip to a spilled drink. As a data center manager, you must be prepared for any piece of hardware to fail from the network cables to the module, switch, router, PDU or firewall, to the internet connections, and to the data center itself.
One can set up critical equipment in failover pairs. Of course, redundancy comes with a cost, but the alternative is system downtime. Before designing or building, know what kind of downtime due to hardware failure your business can accept.
You should have a spares available or accessible or have active failovers to meet your business needs.
2. Don’t rely on support agreements.
The Solarwinds hack was successful primarily due to the over-promise of remotely managed security that allowed one to reduce one’s own dedicated data center staff.
Your business almost certainly has support agreements with OEMs (original equipment manufacturers), software companies, hardware support companies, data center support companies, security firms, and local support engineers. However, the agreements are as good as the staff available at the time you need support. The support timeframe of a few days to repair or replace may be unacceptable for your business. OEMs and hardware companies favor replacing the hardware over troubleshooting what may be a simple fix. A simple cable issue could be checked and fixed in five minutes by someone on the ground. a remote hardware support tech might find it hard to consider the issue as a cable problem rather than a switch problem.
Support agreements or outsourcing often means companies under staff their data center teams. An over reliance on outside support could mean that few people (or no one at all) understand how the data center was set up in the first place.
A person with an intimate knowledge of your data center is more useful than a support agreement.
3. Be vigilant to a possible back door.
Like the Home Depot hack, there may be administrative back doors that leave firms vulnerable to sabotage.
Despite their best efforts, even knowledgeable and talented data center teams cannot guarantee that a data center will be 100% cyber secure. There are many issues that a team must keep up on to avoid being hacked. The procedures are more human and bookkeeping than technical. For example, when a network admin leaves a company, the data center team must immediately change all passwords. Any network access available to a previous network admin must be closed. The same goes for any person who leaves the company, even if their network access is limited. Delayed firmware and software upgrades can leave back doors to the company network.
Your data center is only as safe as the weakest security link, whether human or software.
4. Have an upgrade fallback plan to avoid data center failure.
Never start a network, software, or firmware upgrade unless you have a fallback plan. Installing a newer version of firmware onto a switch that is “better” firmware according to the OEM may result in the switch working great from the OEM’s perspective but not with your data center equipment. Your best bet is to try the firmware in test setup before putting it into production. Test before production applies to networks of any size. Even with testing before installation, there is always the risk of differences between the test environment and production environment.
Always have an option to fall back to the current setup before starting an upgrade.
5. Make only essential changes to your data center.
Overpromising and under delivering is a great formula for failure. Under promising and over-delivering is a better formula to keep a data center up and running. Successive minor upgrades are better than large and extensive upgrades than can have an unforeseen problems that are hard to troubleshoot.
Upgrade one step at a time to isolate issues as they occur.
6. Design your data center with failures in mind.
Ensure the data center has the minimum number of single points of failure or “SPOFs”. SPOFs can be anything to include a minor piece of hardware or script or code that could bring your data center to a halt if it fails. Small data centers have more. Large data centers should have almost none. You must be aware of the potential for failures and have plans, procedures, and spares to address the expected data center failures.
Know your single point(s) of failure and plan accordingly.
7. Do not power all your equipment from the same source.
At a hosting facility, ask for power from two separate power distribution cabinets. Switches, servers, and routers should have dual power supplies. Power cord each power supply from different PDUs where each PDU goes to a separate power distribution cabinet.
You should do the same when setting a data center up in your office. If you do not have two separate power sources, you will be setting up your data center with a power failure risk.
A dual power source ensures that power is not your SPOF.
8. Back up configurations, scripts, firmware, and software.
Many firms fail to maintain a version control system for the little pieces of code that allow the equipment to function together. Discovering every little piece of code used to run the data center is half the battle. Finding a code repository where to store and teaching the team to upload each piece of code whenever it is used or becomes part of the data center is the other half of the battle. Without a system to manage code and changes, data center failure is more likely.
Back up your code.
9. Manage airflow and cooling to avoid data center failure.
You should have designed your data centers for performance, security, redundancy, backup, failover, and other factors. Before turning the data center on for production, make sure you have considered airflow. A surefire way to have equipment fail is to block airflow with cables or panels or to misdirect airflow by having open rack equipment slots that allow air to flow around equipment and not through the equipment.
Making sure that the cabinet equipment spacers are in place on the cabinet, module covers are in place and cables do not block or affect airflow into any piece of equipment. Small deviations in airflow can create catastrophic equipment failure.
Manage airflow and ventilation FANATICALLY or your data center WILL fail.
10. Never do Friday upgrades.
The worst time to do an upgrade or try to change a data center is on Friday before a long weekend. Friday or rushed upgrades are fertile ground for setting oneself up for a data center failure. Should anything go wrong with the upgrade, there is a disgruntled data center team and a rushed installation or upgrade. Friday is the day of the week when a data center team should be doing final checks to ensure that the data center is stable and secure for the weekend.
The best data center management teams agree to a maintenance window with the operations team to avoid a rushed upgrade. Most maintenance windows we have seen are on Monday afternoons or evenings for smaller companies and Saturday nights for larger companies.
Plan your upgrades, so you have time to fix or revert if there is a problem