The Failover Fantasy

The redundancy is built. The health checks are configured. The test ran successfully. The standby instance came online in 45 seconds. The report was filed. Leadership is reassured.

The test ran against a clean staging environment with no load, no state, and no active connections. Production has none of these luxuries.

⸻

The State Problem

Failover works when both sides have the same state. In testing, they do. The database is small. The replication lag is negligible. The application is stateless.

In production, the primary database has accumulated a replication lag of 12 seconds during peak load. When the failover fires, the standby comes online with data that is 12 seconds old. The application begins serving stale reads. Customer transactions processed during those 12 seconds are missing.

The health checks are green. The system is running. The data is wrong.

Nobody tested this, because nobody tests failover under load. The test environment does not have the traffic to reveal the replication gap.

⸻

The Connection Storm

When the primary fails, every client loses its connection simultaneously. Every client reconnects simultaneously. The standby receives a connection storm that vastly exceeds the steady-state load.

Connection pools overflow. Thread pools exhaust. The standby, which handles normal traffic without issue, collapses under the synchronized reconnection wave that failover produces.

The system failed over successfully. The standby came online. Then the standby failed, because nobody designed for the transient load pattern that failover creates.

⸻

The DNS Propagation

The failover changes the DNS record. The standby is now the primary. The TTL expires in five minutes.

But some clients cache DNS aggressively. Some internal services ignore TTL entirely. For the next 30 minutes, a percentage of traffic continues to hit the dead primary. These requests timeout. The users see errors. The monitoring dashboard shows the failover as successful, because it measures the standby, not the clients still pointed at the corpse.

⸻

The Honest Drill

A failover that has not been tested in production under load is a theory, not a capability.

Run a failure drill during business hours. Not in staging. In production. With traffic. Measure the blast radius. Measure the data loss. Measure the recovery time with real clients reconnecting against real state.

This is uncomfortable. It might cause a brief degradation. But a planned, measured degradation is infinitely cheaper than an unplanned one at 2 AM during peak season, when you discover for the first time that the failover does not work.

End.