Focus on Shadowbase Product Management

Keith B. Evans

Why an Active/Passive Business Continuity Solution is Not Good Enough

Active/Passive – Classic Disaster Recovery

In an active/passive architecture, all transactions are executed on a single system (the active node), and the database updates are replicated to a backup system (the passive node). In the event of a failure of the active node a failover to the backup node is executed, users are switched to the backup node, the applications are brought up with the local (synchronized) database opened for read/write access, and processing resumes. This architecture is by far the most common, but it is also the most flawed.

This architecture’s key issue is that it is very difficult to test the backup node and failover procedures. Since the testing requires an outage of the primary node and may take a long time, failover testing is very often not performed at all, or not to completion (because it takes longer than the available outage window). It is also possible that restarting the production system after the test has completed may not work, which is another reason why testing may be skipped. Due to this lack of testing and the resulting uncertainties surrounding the state of the backup system and the takeover procedures, when a real outage occurs, management is often slow to initiate a failover in the first place, further delaying recovery. Therefore, this architecture is risky, the state of the backup system is not really known, failover faults are likely to occur, and the failover will be unsuccessful, or at least take a long time. For all of these reasons, this architecture has the probability of a high recovery time of several hours or even days. A basic active/passive architecture offers some protection, but it is not the best solution and should really only be considered as a starting point, or used for non-mission-critical applications.

Active/“Almost-Active” – Sizzling-Hot-Takeover (SZT)

While looking almost the same as a classic active/passive architecture, sizzling-hot-takeover (SZT) has one difference which makes it a much, much better solution. While all transactions are still routed to a single active node, the backup node’s applications are already up-and-running, with the local database open for read/write access. The key benefit of an SZT architecture versus an active/passive architecture is to ensure the backup system is ready to go when you need it. Since the applications are up-and-running on the backup node with the local database open read/write, it is easy to send test transactions to validate the backup system at any time, with no impact to the active system. Therefore, the backup system can be regularly validated, and becomes a known-working system. It is, to all intents and purposes, a fully active system, with the exception that it is not processing online transactions. When an outage occurs of the primary node, the decision to fail over can be made immediately, with confidence that failover faults will not arise, and the failover will succeed quickly. There is a much better and repeatable recovery time for SZT versus classic active/passive architectures. SZT is the minimum level of business continuity solution which should be employed for mission-critical applications.

Active/Active – Partitioned

In a partitioned active/active architecture, the applications are active on all nodes, transactions are routed to all nodes, and each node has a copy of the database which is kept synchronized by bi-directional data replication. But the data is partitioned so that transactions are routed to a specific node based on some key in the data, or from which user the transaction originated, etc. For example, the database may be split by customer name, and all transactions for customers A-M are executed on one node, and customers N-Z on the other node. This architecture provides most of the benefits of active/active, but avoids one of the biggest potential issues: data collisions, where the same record is updated simultaneously on multiple nodes, resulting in database inconsistency.

The benefits of active/active architecture compared to classic active/passive and SZT are:

On outage, only half of the users (in a two node configuration) are affected and need to be switched. The other half of the users see no outage at all, i.e., faster recovery times.
There is about half as much data loss (in a two node configuration), i.e., less data loss. Only the updates in the replication stream on the failed node are lost.
There are no testing costs/issues, and no failover faults. All systems in the configuration are known to be working, which is also true for SZT.

Active/Active – Route Anywhere

Active/active-route anywhere architecture is the same as the active/active-partitioned model described above, except that the partitioning aspect is removed; any transaction can be executed by any node (hence the name, route anywhere). This architecture has all the benefits of the active/active-partitioned model, but also eliminates two of the issues with that model. It does not require partitioning (which may not be possible in all cases), and the workload is evenly load-balanced across the nodes, since transaction routing is unconstrained.

There is always a price to pay, and in this case, it is the possibility of data collisions. For some applications data collisions may be practically impossible. (For example, it is highly unlikely that the same credit/debit/ATM card would be used simultaneously for multiple transactions.) If collisions are possible, then they must be addressed. Shadowbase software includes functionality to automatically detect, report, and resolve data collisions. User exits are also provided to enable more sophisticated processing of data collisions if necessary.

Conclusion

In order to implement a business continuity plan, the IT architecture must be selected for maintaining services in the event of an outage (planned or unplanned). Many users select, and never get beyond, basic active/passive architecture, but it has many issues, which can prevent a successful and timely failover. This model is reactive, risky, and provides a false sense of security. The likelihood of an extended outage is high, consequently the likelihood of a very expensive outage is high. Active/passive architectures are simply not good enough for mission-critical applications.

The more sophisticated business continuity solutions (SZT and active/active), while more complex and somewhat more expensive to implement, are in fact far more cost-effective when looked at through the lens of total cost of ownership (TCO). If you are running an active/passive architecture, it will probably only take one outage during peak hours to find this fact out the hard way. SZT is only marginally more complex than active/passive to implement, yet the benefits are significant. SZT should be the absolute minimum architecture chosen for applications which must remain available.

SZT itself should only be seen as a stepping stone to a fully active/active architecture. An active/active architecture halves outage and data loss costs, and significantly improves the utilization of systems, since there is no idle backup system. Active/active architectures provide the only acceptable solution for applications with high-value transactions where data loss must be minimized, and/or must be continuously available.

The remedy is in your hands; the attention grabbing outage headlines need never apply to your company, nor the embarrassing meetings with senior management to explain why the outage persisted for so long. With Shadowbase data replication, the solutions are available today to make extended outages a thing of the past. If you are currently using an active/passive configuration, do not pass go, move immediately to an SZT setup. If you are already running in SZT mode, congratulations, but also consider whether you should be taking the next step to a fully active/active implementation. Making the right decision in hindsight is too late when it comes to protecting the availability of your mission-critical applications and data.

More information about these products can be found at http://hp.com/go/nonstopcontinuity. A recent article on this subject that was published in The Connection is also available here.

Please reference our Newsletter Disclaimer.