HomeNetwork KnowhowTen ways to avoid a data center failure
March 27, 2026 | First published June 24, 2021

Ten ways to avoid a data center failure

There’s a special providence in the fall of a sparrow. If it be now, ’tis not to come; if it be not to come, it will be now; if it be not now, yet it will come. The readiness is all.
~ Hamlet

Shakespeare could have been speaking as a data center manager. Essentially, if something is fated to happen, it will happen. If not now, then later. If not later, then now. Therefore, it is essential to be ready when it happens, not if it happens.

Much like Hamlet, a modern data center manager must be ready to act when a critical network failure occurs. Failure in a data center is not a question of if, but of when. Hardware fails, configurations drift, people make mistakes, and small issues cascade faster than expected. The only variable is timing.

No one writes a post-mortem about the failure that did not happen. The spare that was on the shelf. The rollback that took four minutes. The credential that was revoked the day someone left. That is what readiness looks like.

1. Have a spare or failover plan

Equipment fails for four reasons: dust, heat, misuse, and everything else.

Dust is everywhere. Even in filtered environments, fine particulates build up and restrict airflow. Heat is cumulative. A small obstruction or reduced airflow can quietly push equipment past its tolerance. Misuse ranges from physical damage to bad firmware to incorrect configuration.

“Other” covers everything from failed components to accidental damage. It does not matter what failed. What matters is how quickly you can recover.

In the 2017 AWS S3 outage, a routine action triggered a wider failure because dependent systems were not designed to fail independently.

If you cannot replace or fail over quickly, you are accepting downtime.

2. Do not rely on support agreements

Support agreements look solid until you need them.

The SolarWinds breach exposed how much trust organizations place in external systems and managed services. That trust often replaces internal understanding.

A simple issue on-site can take hours through remote support. A loose cable becomes a diagnostic exercise. A configuration issue becomes a replacement cycle.

A person who knows your environment will fix problems faster than a contract ever will.

3. Assume there is a back door

No environment is fully secure. Access accumulates. Credentials linger. Updates get delayed.

The Home Depot breach started through a third-party account. Not an advanced exploit. Just access that should have been controlled.

When someone leaves, access must go immediately. Firmware and software must stay current. Security is not a one-time task.

Your environment is only as secure as the access you forgot to remove.

4. Always have a rollback plan

Never start an upgrade without a way back.

Firmware that works perfectly in testing can behave differently under production load. Dependencies, traffic patterns, and edge cases do not show up until it matters.

If you cannot revert quickly, you are not in control of the change.

5. Make fewer, smaller changes

Large changes create large unknowns.

In 2019, a small ISP mistakenly advertised thousands of incorrect BGP routes, briefly disrupting major services including Cloudflare. A single configuration error propagated globally within minutes.

Change one thing at a time, or troubleshoot everything at once.

6. Design out single points of failure

Single points of failure are not always obvious.

In real incidents, access switches have become root bridges and tried to handle traffic loads they were never designed for.

If one failure can stop everything, it eventually will.

7. Separate your power sources

Dual power supplies only help if they are truly independent.

Many outages trace back to “redundant” systems sharing a single upstream dependency.

If both feeds fail together, they were never separate.

8. Back up everything that matters

Configurations, scripts, and small pieces of code often hold environments together.

Teams usually discover gaps after failure, when rebuilding depends on undocumented pieces.

If you cannot rebuild it from backup, it is not backed up.

See also: 12 components of a successful network disaster recovery plan

9. Manage airflow and cooling

Airflow problems are easy to create and hard to detect.

Blocked vents, missing blanking panels, and poor cable management all affect how air moves through equipment. Small inefficiencies add up.

Manage airflow and cooling relentlessly or your hardware will fail.

See also: The future of data center cooling

10. Never do Friday upgrades

Timing matters more than people admit.

Changes made before weekends or holidays remove your ability to respond properly. When something breaks, the team is tired, short-staffed, or unavailable.

Give yourself time to fix what you break.

Readiness is all

Failure will come. Not dramatically, and not always for obvious reasons. A cable, a credential, a minor change. The difference between an incident and an outage is readiness.

About NetworkTigers

NetworkTigers is the leader in the secondary market for Grade A, seller-refurbished networking equipment. Founded in January 1996 as Andover Consulting Group, which built and re-architected data centers for Fortune 500 firms, NetworkTigers provides consulting and network equipment to global governmental agencies, Fortune 2000, and healthcare companies. www.networktigers.com.

Mike Syiek
Mike Syiek
Mike Syiek is Founder and President of NetworkTigers and has more than two decades of experience in networking, IT infrastructure, and the global technology supply chain.

What do you think?

Popular Articles