We recently had the opportunity to work with a global communications company that shapes how consumers, businesses, governments and militaries around the world communicate. As part of our work with them, we were establishing a new AWS Disaster Recovery (DR) process, part of which had a Jenkins backup in another AWS Availability Zone (AZ). In helping the firm with its cross-AZ DR process, we were challenged with the Jenkins job status which impacted the AZ status. AWS Disaster Recovery blog is the story of how we approached the situation, creating a unique solution that solved the customer’s needs all in one place.
The Project Goal
The communications company had an established DR solution that in its pursuit of constant improvement, it wanted to further hone, ensuring seamlessly delivered uptime and availability through cross region DR. To do so, it brought the DevOps team at Flux7 on board to establish a world-class solution for creating backups, restore, and DR for Jenkins within a 30 minute RPO.
As a critical piece of the cloud stack, it’s important to have a Jenkins backup, including its data and configurations. Many organizations take an automated approach, with the init script deploying EC2 and installing Jenkins. Backup is restored from S3 and in this way it effectively handles instance and AZ failure. However, we realized that this approach was not an ideal solution for the customer as the Home Directory risks maxing out, it doesn’t support cross-region for DR and it won’t hold the job status. As a result, we needed to find an elegant alternate solution.
With a goal of extreme resiliency, the AWS Premier Consulting Partners at Flux7 created a robust solution that backed up Jenkins across AZs without disrupting itself. Specifically, we created a Jenkins AMI with an attached EBS volume for the Jenkins Home Directory. To create backups, we used a Lambda function to create regular nightly snapshots of the Jenkins AMI which was then copied to the DR region. And the EBS volume with the Jenkins Home Directory had snapshots taken every 30 minutes which were also copied to the DR region.
In the event of an instance failure, restore functions kicked in with,
- An instance created using the latest snapshot AMI in the DR region
- A new volume created out of the latest snapshot and attached at the Jenkins home mount point through CloudFormation
- Groovy initialization scripts restored connectivity settings
All in One Solution
The Flux7 and customer team liked this solution as it met the goals to handle instance and AZ failures within an aggressive RPO, while meeting the need for a single solution. In addition, the solution addressed weaknesses to other approaches to Jenkins backup, restore and DR. While there are more moving parts, the solution has many benefits including:
- Backups are independent of Jenkins
- The use of EBS volumes makes the solution more robust for easy expansion
- Snapshots are a tried and tested approach
- Job status is retained and it
- Allows for a separate volume for Jenkins Home
Flux7 taught the customer along the way each of the steps involved in the process, and more importantly, how to effectively manage and maintain the solution moving forward in order to ensure long-term system resiliency and availability. In the process, Flux7 was able to introduce system automation to the communications company, who is now using SSM to automate a wide variety of management tasks like applying OS patches, creating Amazon Machine Images (AMIs), and configuring applications at scale.
Overall, with the Jenkins DR solution, the customer is able to significantly streamline its process, getting eight apps up and running in just 30 minutes. The manual process was time consuming, faced fat finger and other human errors that would cause a restart and an even lengthier process.
For additional DevOps best practices, please see our DevOps case studies reference page or subscribe to our blog below.
Post Date: 10/18/2018