AWS Disaster Recovery Solution using the Cloud Endure Tool

disaster recovery

What is a DR?

    A natural disasters like earthquakes or floods, technical failures such as power or network loss, and human actions may cost your organization business loss. To safeguard from such outage, organization should implement Disaster Recovery solution for On-prem or Cloud environment. This blog is about the latest AWS Disaster Recovery Solution implemented for one of our customers using Cloud endure tool.

Problem statement

Customer has been relying only on backup strategy, but it will not help to achieve business continuity with minimal business impact when a disaster happens, hence they need for a solution to bring up the infra seamlessly during the event of a disaster.

Customer’s expectation – Recovery Time Objective (RTO is the maximum acceptable delay between the interruption of service and restoration of service) was 45 minutes and Recovery Point Objective (RPO is the maximum acceptable amount of time since the last data recovery point.) was 10 minutes.

CloudEndure, an AWS Disaster Recovery tool minimizes downtime and data loss by providing fast, reliable recovery of servers and data. CloudEndure Disaster Recovery continuously replicates your machines (including the operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region.

Target Region

Choosing the target region is key factor. Customer’s compliance requirement and network latency for the end users must be considered in choosing the right region for DR. In our scenario source region is North Virginia. Oregon region has been chosen as the best for DR setup considering the customer compliance requirement (movement of data between countries is not acceptable) and its proximity to the source region. Oregon is also cost effective compared to other regions.

DR Architecture:

Source Region

Target Region

Staging Area

Implementation

  1. Complete the networking setup for CloudEndure Disaster Recovery
  2. Register for a CloudEndure DR account, create a DR project, and get an agent installation token.
  3. Install CloudEndure Agent on Source VMs.
  4. Configure Blueprint for target server in the CloudEndure DR User Console
  5. Perform failover from US-East-1 to US-West-2

Complete the networking setup for CloudEndure

Replicate the VPC and other networking setup in the DR region. If there is any private IP dependency for the environment, it is important to maintain the same subnet range.

Adjust the security settings on both regions to allow the ports necessary for CloudEndure DR replication and authentication.

Communication over TCP Port 443:

  • Between the Source Machines and the CloudEndure Service Manager.
  • Between the Staging Area and the CloudEndure Service Manager.

Communication over TCP Port 1500:

  • Between the Source Machines and the Staging Area

Register for a CloudEndure DR account, create a DR project, and get an agent installation token.

  • Create a CloudEndure DR account by providing an email address

The first step in using CloudEndure is creating a project. There are two types of projects:

  1. Migration
  2. Disaster Recovery

Select Disaster Recovery.

After creating a project, select the source and destination environments, in this solution source is Virginia and destination is Oregon. To integrate Cloudendure and AWS, an IAM user has to be created with necessary permissions. Here is the IAM details – CloudEndure IAM policy.

Source and Destination Environments

Once the replication setup is complete, Cloudendure agent installation has to be done.Cloudendure agent installation instruction is available on How to Add Machine Page. Agents are available for Linux and Windows OS. Here, it is windows.

Install CloudEndure Agent on Source VMs

Download the agent installer and install, during the installation process it will prompt for the token as shown below screenshot.

Once installation is completed across all the servers, CloudEndure will start creating the replication server in the staging area using configurations that have been set up in the Blueprint. Blueprint setup information is as follows,

Blueprint Configuration

While waiting for the initial data sync to complete and enter Continuous Data Protection state, choose Machines in the CloudEndure dashboard to go to the Blueprint page.

This can be done in two ways,

1.Manual

2.Automated

Manual

Blueprint will instruct how to launch the target VMs, example instance type, subnet,SG, private IP and other details. All these details have to be fed manually.

Automated

copy of source, which lets Cloudendure to fetch the source VM, disk type, private IP and implement the same in to target region.

Perform failover from North Virginia to Oregon

To perform failover, the VMs should be in Continuous data protection state in the Cloudendure console.

There is an option to failover in Test Mode, its recommended to test the setup before actual Drill.

There is also an option to choose the recovery point for every 10 mins in past one hour and also can create new recovery point where we can minimize the data loss but in real scenario if a disaster occurs and the source machine is no longer available, CloudEndure would use the most recent consistent snapshot to launch a recovery instance.

After testing, Disaster Recovery Lifecycle column will show “Tested Recently” with a green bar to the left of the machine name. This means the machine is now ready to be failed over.

Steps for failover

  • Launch the VMs in Recovery mode (Actual failover).
  • After launching, the VMs must point the new public IP/LB end point to the route53 so that it will direct the flow to DR region. (Here is the health check configuration in route53 to minimize the downtime)

Amazon Route 53 DNS failover decision can be configured as primary, secondary using the route53 health check.

  1. If the primary health check fail, it will direct to secondary webserver.
  2. If the secondary health check fail, return to primary webserver (After fallback)

Conclusion

DR is the real savior whenever there is an outage due to natural calamities or technical glitches. It is important for every organization to have DR in place, with cloudendure AWS has made it simple and easier for the architects to set it up and manage efficiently.

Leave a comment

Your email address will not be published. Required fields are marked *

Cloud Cost Optimization GUARANTEED!

Save throughout the year while paying us what you save in the FIRST 3 MONTHS ONLY.