What is Disaster Recovery as a Service (DRaaS)?

Until the last few years companies wanting quick fail-over to a remote site faced huge costs, time and complexity.  Consequently, only very large companies with deep pockets could afford to implement offsite disaster recovery.

Today, the technology and the internet have enabled highly cost effective methods to provide DR services to organizations that traditionally have not been able to afford such capabilities in the past.  In combination, today’s consumers expect the services they use to always be available, and this drives many companies to look at implementing DR fail-over so business is not interrupted.

The cost of an organization hosting its own disaster recovery site can be cost-prohibitive. Not only in terms of money but also time, and effort. Costs include hosting remote sites, managing servers, managing applications, monitoring the backups and replication, and regular testing. However, it’s possible to offset these costs by utilizing a service provider offering Disaster Recovery as a Service, or DRaaS.

DRaaS is a way for organizations to utilize service providers, like Managecast, who provide protection for virtual servers in a cloud environment by offering infrastructure, software, and management for DR solutions.

Failover

Organizations utilizing DRaaS replicate their data continuously or periodically, depending on their desired Recovery Point Objective (RPO), to the service provider. Then, in a DR event, the organization can fail-over all or some of their environment by simply powering on their VMs in the service providers cloud-DR infrastructure and continue to operate.

The organizations have access to failed over replicas through predefined methods. In the event of a partial failover of only some of the organizations servers their local network can be extended to the cloud-DR environment allowing them to access the servers as if they were still hosted locally. Alternatively, in a full failover event an organizations servers can be accessed remotely. E.g. through a web console, VPN or remote desktop services. Service providers and also provide new public IPs to minimize downtime for public facing applications.

An example of a web console used for failover and testing DR.
An example of a web console used for failover and testing DR.

If after the fail-over has been performed the organization is able to get their local infrastructure back up and running, depending on the DR solution, they can also fail back to production. Failing back means replicating any changes made during the fail-over in the DR environment back to the production side.

Testing

After replicating to the service provider, it will be necessary to perform regular DR testing to make sure things go smoothly in a DR situation. Most DRaaS providers will allow organizations to perform their own testing which allows them to set test criteria.

Testing can be as simple as logging into the service providers web console, powering on a VM, and verifying application or service functionality.

Costs

While not all service providers charge for DRaaS the same, a common model is based on usage per hour. Meaning that the organization will be charged for only what they use.

Management

In some cases the DRaaS provider will offer additional management in terms of the replication process. This can include monitoring the replication, alerting the organization of any potential issues, as well as providing fully-managed service solutions.

While an organization may view DR as an additional cost, for DRaaS service providers providing backup and replication is their sole focus. By using a service provider for DRaaS they gain access to that expertise and can leverage them for any DR needs.

Zerto backup fails unexpectedly

We had a recent issue with Zerto backups that took some time to remedy. There was a combination of issues that exposed the problem, and here is a run down of what happened.

We had a customer with about 2TB of VM’s replicating via Zerto. We wanted to provide backup copies using the Zerto backup capability. Keep in mind Zerto is primarily a disaster recovery product and not a backup product (read more about that here: Zerto Backup Overview). The replication piece worked flawlessly, but we were trying to create longer-term backups of virtual machines using Zerto’s backup mechanism which is different from Zerto replication.

Zerto performs a backup by writing all of the VM’s within a VPG to a disk target. It’s a full copy, not incremental, so it’s a large backup every time it runs, especially if it’s a VPG holding a lot of VMs. We originally used a 1Gigabit network to transfer this data, but quickly learned we need to upgrade to 10Gigabit to accommodate these frequent large transfers.

However, we found that most of the time the backup would randomly fail. The failure message was:

“Backup Protection Group ‘VPG Name’. Failure. Failed: Either a user or the system aborted the job.”

To resolve the issue we had opened up several support cases with Zerto, upgraded from version 3.5 to v4, implemented 10Gigabit, put the backup repository directly on the Zerto Manager server.

After opening several cases with Zerto we finally had a Zerto support engineer thoroughly review the Zerto logs. They found there were frequent disconnection events. With this information we explored the site-to-site VPN configuration and found there were minor mismatches in the IPSEC configurations on each side of the VPN which were causing very brief disconnections. These disconnections were causing the backup to fail. Lesson learned: It’s important to ensure the VPN end-points are 100% the same. We use VMware vShield to establish the VPN connections and vShield doesn’t provide a lot of flexibility to change VPN settings, so we had to change the customer’s VPN configuration to match the vShield configuration.

Even though we seemed to have solved the issue by fixing the VPN settings, we asked Zerto if there was any way to make sure the backup process ran even if there was a connection problem. They shared with us a tidbit of information that has enabled us to achieve 100% backup success:

There is a tweak that can be implemented in the ZVM which will allow the backup to continue in the event of a disconnection, but there’s a drawback to this in that the ZVM’s will remain disconnected until the backup completes. As of now, there’s no way to both let the backup continue and the ZVM’s reconnect. So there is a drawback, but for this customer it was acceptable to risk a window of time that replication would stop to make a good backup. In our case we made the backup on Sunday when RPO wasn’t as critical, and even then the replication only halts if there is a disconnection between the sites which became even more rare since we fixed the VPN configuration.

The tweak:

On the Recovery (target) ZVM, open the file C:\Program Files (x86)\Zerto\Zerto Virtual Replication\tweaks.txt (may be in another drive, depending on install)
In that file, insert the following string (on a new line if the file is not empty)
t_skipClearBlockingLine = 1
Save and close the file, then restart the Zerto Virtual Manager and Zerto Virtual Backup Appliance services

Now, when you run a backup, either scheduled or manual, any ZVM <-> ZVM disconnection events should not cause the backup to stop.

I hope this helps someone else!

Zerto Backup Overview

Zerto is primarily a disaster recovery solution that relies on a relatively short-term journal that retains data for a maximum of 5 days (at great expense in disk storage). Many Zerto installations only have a 4-hour journal to minimize the storage needed for the journal. Zerto is a great disaster recovery solution, but not as great as a backup solution.  Many customers will augment Zerto with a backup solution for long-term retention of past data.

Long-term retention is the ability to go back to previous versions of data, which is often needed for compliance reasons. Think about the ability to go back weeks, months, and even years to past versions of data. Even if not driven by compliance, the need to go back in time to view past versions of data is very useful in situations such as:

  • Cryptolocker type ransom-ware corrupts your data and is replicated to the DR site
  • Legal discovery – for example, reviewing email systems as they were months or even years ago.
  • Inadvertent overwriting of critical data such as a report that is updated quarterly. Clicking “Save” instead as “Save As” is a good example of how this can happen.
  • Unexpected deletion of data that takes time to recognize.

For reference and further clarification, check out the differences between disaster recovery, backup and business continuity.

Even though Zerto is primarily a disaster recovery product, it does have some backup functions.

Zerto backup functionality involves making an entire copy of all of the VM’s within a VPG. We sometimes break up VPG’s with the goal to facilitate efficient backups. One big VPG can result in making one big backup which can take many hours (or days) to complete. Since it’s an entire copy of the VPG it can take a significant amount of time and storage space to store the copy. Each backup is a full backup and currently no incremental/differential backup capability exists within Zerto.

It is also advisable to write the backups to a location which support de-duplication, such as Windows 2012 Server. It still takes time to write the backup, but the de-duplication will dramatically lower the required storage footprint for backing up Zerto VPG’s. Without de-duplication on the backup storage you will see a large amount of storage consumed by each full backup of the VPGs.

Zerto supports the typical grandfather-father-son backup with daily, weekly and monthly backups for 1 year. Zerto currently does not support backups past 1 year, so even with Zerto backups, the long-term retention of data is not as good as with other products designed to be backup products. However, Zerto really shines as a disaster recovery tool when you need quick access to the latest version of your servers. It’s backup capabilities will get better with time.