Business Continuity FAQ

Frequently Asked Questions

Shutterstock

Question #1: Has Gravic ever been involved with helping a customer recover its system after a significant disaster? If so, what was the worst part of the recovery?

Answer: Yes, we have, and quite frankly the personnel involved are both the greatest asset – as well as the worst detriment or weakness – to a successful and rapid recovery.

Essentially, the preparation and availability of the personnel makes all the difference. A famous IT colleague once quipped, “When things go wrong, people get stupider.” In our experience that is absolutely true. Almost no one is at his best in the wee hours of the morning over a long holiday weekend when a system crash invariably happens.

During a recovery, it is the people who know the systems, infrastructure, and especially the application software and application data that matter the most.

They are the ones who can save the business, as they usually have the knowledge about what ultimately needs to be done to “fix” what is wrong to get the application and data back online and functioning properly. Having technical people who know performance, data comm., security, etc., matters as well; but when data loss is suspected or the application keeps faulting, the applications team is the best source of information for determining what matters now and what can wait until later. These employees must be educated and practiced in the recovery procedures for the application. Companies must invest in actual and formal testing, practicing, and education on how to recover operations when a serious fault occurs.

HPE NonStop Servers “hide” localized faults so well from staff that many become complacent and think they and their environments are invincible. But this complacency can kill a business if it does not prepare for the inevitable faults, such as datacenter fires, regional power grid outages, or even the horrors of the next 9/11.

More importantly, why are we talking about “recovering” at all?

It is a uni-directional, active/passive term, associated with the older disaster recovery architectures. More advanced, higher levels of availability can be deployed to avoid the need for a recovery, meaning that the application services at another location actually survive the geographic fault and continue the application processing.

In a best practice architecture, this service will occur automatically.

We always advise customers to move beyond a uni-directional, active/passive, “recovery” architecture and into the more advanced and higher performing sizzling-hot-takeover and fully active/active, automatic failover architectures to achieve these benefits.

Shutterstock/Darrin Loeliger

Question #2: Are there any differences in the effort to recover an environment using ENSCRIBE, SQL/MP or SQL/MX? If so, what are they?

Answer: Put simply, there are massive differences! The first step is to realize that each environment is different.

Enscribe allows a lot more flexibility with recovering lost or broken data and files than SQL environments. However, consider what is being used to back up the environment. Backup/restore tape- or virtual-tape-based? Data replication? Or nothing, meaning that no method for BC is in use?

The second step is to understand the details.

Please realize that each HPE NonStop file system supports several file/table types, including structured and unstructured types, and each type requires a slightly different recovery process.

More specifically, is the environment:

    • Without a BC solution?

      • If that is the case, forget it. It is difficult to go anywhere from that, except to quit and close the business. You just earned your “CNN Moment.” It is critical for companies to have BC for crisis recovery. This is comparable to a fire department having access to water during a fire.
    • Using TMF protection?
      • The process is a bit different for TMF vs no-TMF. Regardless, there are some similarities.
    • Using tape or virtual tape?
      • Hopefully, the backup/restore capabilities for all file/table types have been tested.
    • Using SQL?
      • SQL has extra requirements when doing a backup/restore, such as making sure that the catalog information is recorded. Not properly doing this requirement can lead to an inability to register the tables in a new catalog when they are restored. Additionally, SQL generally requires that the application programs be re-compiled (e.g., via SQLCOMP) after the tables are restored, which also can take time, especially if there are hundreds to thousands of application programs.
    • Backing up open files/tables?
      • This method can lead to broken and unusable files/tables being restored. (By open we mean the file or table is opened for updating by the application.) To avoid this issue, the data files/tables need to be closed for updating, which typically means an application outage. Is this step feasible for the business each time a backup is taken?
    • Using TMF?
      • TMF-based recovery is a best practice in the tape/virtual tape space (e.g., online dump/restore and audit trail roll-forward). TMF online dumps allow the application to remain online for updating during the dump operation, and there is no fear of broken files/tables being backed up. Hence, no application outage is needed. Saving the audit trails provide the “roll-forward” capability to bring the file/table being restored current and consistent, applying all committed data and removing any inconsistent/aborted data that occurred during the online dump. For SQL data recoveries, the same issues exist for compiling the programs with the new tables, which can be lengthy for many application programs.

Regardless, any form of backup/restore or online dump/recovery will be slower and take more time than using a change data capturebased (CDC) data replication engine to replicate, backup, and recover data.

The third step is to understand how CDC-based data replication engine products work; they track, copy, and distribute new application data changes in near real-time from a source system to a target system.

At the time of failure, data replication products start with a target database already loaded and synchronized with the source database; therefore, most of the data is already available to applications in the backup location. Generally, the data replication product is behind the changes being made by the application at the source (called latency) for a few seconds (to perhaps a few minutes in special cases). Failing over is then generally quite fast, because the data replication engine might need to clean up any incomplete transactions in process at the time of failure; but then the database is usually available to the applications to come up from that point forward. In some architectures (e.g., active/active), the target application environment could already be running and is thus available instantaneously. This practice is best as it reduces the risk that a failover fault may occur that would prevent the recovery from proceeding successfully.

Recognize the Benefits of a CDC Data Replication Engine

Some data replication architectures enable an application to remain online during a recovery. Therefore, application outages are longer with tape/virtual tape solutions. We generally view backup/restore (or online dumps/roll-forward) having a mean time to recover (MTTR) of hours to days vs. data replication’s minutes to seconds. Since the backup database is already available at all times, the SQL programs on the target can already be SQL-compiled, so no extra time is needed for that operation. Essentially, during a failover, the data replication handles the data, which allows the team to focus on ensuring that the application is up and running and that network communications are appropriately switched to route user requests to the surviving system and application environment.

Adobe

Question #3: What is the biggest advantage of using a data replication product like Shadowbase software to recover data vs. virtual tape?

Answer: Comparing virtual tape and a data replication solution is not a proper apples-to-apples comparison. In a business continuity (BC) context, the two simply are not comparable in today’s high to continuous availability world for business- and mission-critical applications. We previously wrote an article in The Connection about this topic; click here to access it.

Data Replication

In a BC context, its whole goal is to keep one or more copies of the data up-to-date in another location. If the copy is geographically dispersed, then true geographic disaster tolerance exists, because when a failure occurs, applications can quickly be recovered using the remote data copy. Data replication should be used to maintain a consistent and complete copy of the data in another location that is quickly accessible to the applications and is current at all times. By current, we mean that it is up-to-date, or only has very low latency, where the changes being replicated into it are only a moment (e.g., sub-second to a few seconds) behind the time when changes were made at the source database.

Tape/Virtual Tape

Tape/virtual tape works on a different principle. It provides a data copy capability, and is perfect for making an accurate and consistent copy of data at a particular point-in-time, and freezing that copy at the time it was taken. However, the copy itself is not directly available for applications to use, and if necessary, it will take some time (hours to days) for it to be restored. Tape/virtual tape should be used to save a consistent and complete copy of the data at a particular point-in-time, (e.g., daily, weekly, monthly, quarterly, or yearly). It is quite useful for storing online dumps and audit trails in case they are later needed for file or table recovery.

The Modern Datacenter Needs Both Tape/Virtual Tape Operations and Data Replication

The issue with data replication is that it is so fast and absolute. Accidentally purging a necessary file or table on the source will lead to that purge being replicated to the target, and the file or table is lost forever. However, tape/virtual tape can accurately retrieve a snap-shot of that file or table to the point when it was backed up, and then go from there.

Please submit your questions to us, and we will get back to you as soon as possible!