We recently had the opportunity to work with a publicly-traded media organization to enable blue/green and rolling restart deployment pipelines for its customer-facing website. Part of a larger effort to replatform its entire data center with hundreds of applications, the website is moving from on-premise to the public cloud. (See our blog, DevOps Adoption Case Study: Developing an AWS Cloud Migration Path for additional background.) Running Amazon EKS in AWS, the overarching goal of the project is to increase this organization’s agility, assuring uptime and high availability of the website as it is the company’s revenue engine.
The website generates revenue for the company in a few ways.
- It features a paywall for content,
- Serves as the portal for paying subscribers to access content,
- Offers conferences, webinars, and other event passes for sale.
More than just serving content, the website also curates content, features a recommendation engine, has a search engine and more.
Based on the advice of the DevOps consulting experts at Flux7, our customer team agreed that a container-based microservices infrastructure would be the best approach. With over 40 services — such as java.tomcat real-time transactional apps and AWS batch jobs — comprising the website, it was decided that the real-time apps would be containerized and run inside Kubernetes. Similarly, batch jobs would be containerized but run in AWS batch services. The Flux7 and client teams would deploy Kubernetes clusters on AWS with KOPS.
Developing Kubernetes Proofs of Concept
As we began moving through the process, from a developer productivity standpoint it became clear that Proofs of Concept (POCs) were needed before the applications could be modernized and moved to Kubernetes. As a result, the project included several POCs including:
- Moving secrets from HashiCorp Vault into Kubernetes containers
- Creating shared storage
- Working with apps within memory session state
- Sharing storage communication among the firm’s proprietary apps
- Using open-source Kubernetes with role-based access control (RBAC)
- And implementing two solutions for zero downtime deployment: A blue/green deployment and a rolling restart deployment
As the POCs progressed, and different teams worked on the POCs, we realized that we needed more than a single Kubernetes cluster. For example, with the RBAC POC, we needed to install the RBAC plugin and for the Vault POC, we needed to set up a Kubernetes cluster to communicate with HashiCorp Vault. It quickly became obvious that several clusters were needed.
To address this, we created a Kubernetes factory with several pipelines through which people could deploy as many Kubernetes clusters as needed into a sandbox AWS account. The KOPS pipeline creates a Kubernetes cluster on demand so that each time a Kubernetes cluster is needed for a POC or to test a concept, the customer can now simply go to the portal and create a Kubernetes cluster.
While the Kubernetes clusters produced through the Kubernetes Factory were initially used for POCs, they were production-ready. (I.e. they were multi availability zone, high availability clusters created using KOPS in secure VPCs inside AWS.) This enabled the customer to conduct individual POCs at full speed in parallel.
The AWS Migration
Once the customer’s POCs were proven and its team’s skills were developed, we were able to leverage a large number of developers to migrate its 40 services into Kubernetes. Once service was ready to move into Kubernetes, we would:Write a Dockerfile based on a Dockerfile template that had been created;
Write a Kubernetes YAML file based on a YAML file template that had been created;
Use the sandbox cluster to test that everything was deployed as expected;
Push the code into our code repository;
Create a container pipeline to build the image and deploy it onto the development cluster;
If the development cluster passed, the developer would get certification from security and operations, move it into staging, and queue it up for QA;
Once QA is complete, the service moves into production;
And Apache API gateway is updated so that public traffic coming into the cluster for that service is routed to the Kubernetes cluster while traffic for older services continues to be routed on-premise.
With this process in place, the speed at which the customer is migrating applications is quite brisk, with the team demonstrating two to three new services every week on the new system.
Production-Grade Rolling Restart and Blue/Green Pipelines
For this customer, we created two deployment pipelines:
- A rolling restart pipeline which allows us to update a single application and is used for fast, minor updates.
- And a blue/green pipeline which creates a separate stack of applications and is used for major updates, e.g. an API change.
The customer wanted to have uninterrupted domain parking services so that they could perform blue/green deployments without affecting redirect services. Notably, during a blue/green deployment, services continue to work for both blue and green so if a blue or green is canceled, terminated or rolled back, service remains uninterrupted. This is due to a manual drain connection step in the process whereby traffic is fully routed from the old stack to the new stack. Once existing connections to the old stack are fully drained, it is deleted.
Similarly, containers of applications in the old stack communicate only with containers within old stack while new stack containers communicate only with new stack containers in the blue/green pipeline. For a new stack, the set up only deploys new containers for those containers that are being updated. As a result, they do not waste resources by copying the entire microservices environment. Moreover, containers that do not change between deployments are also not recreated.
14-28x Increase in Release Frequency
Prior to moving to AWS EKS, Kubernetes and the public cloud, this customer conducted releases every two weeks. However, once they were able to execute the blue/green pipeline in their production environment, their confidence grew significantly. Now, with the aid of automation, they have increased their release frequency 14-28x, issuing up to two releases per day.
In addition, the SLA for the firm’s client-facing website services is 24/7 with zero downtime; small outages are acceptable for internal-facing services. As a result, blue/green and rolling deployments are now used wherever possible for the company’s high availability client tier. And, with the ability to update the website multiple times each day, with size and/or complexity of the update minimally affecting deployment, teams can deliver innovation faster, growing customer satisfaction.
Additional Benefits Achieved
In addition to increasing its release frequency, the firm has gained greater insight into its systems. SysOps now has historical data specific to each upgrade which allows them to more effectively manage the performance of the upgrade process, tracking relative information for each job, including success or failure, for rollback when necessary. And, the company can now visually monitor the deployment process with a visual monitor accessible by authorized users that ensures the process is secure and successful.
The customer team also has greater control over the environment, with the ability to manually control when the last ‘blue’ deployment environment is deleted so they can ensure minimal user disruption when the new deployment begins. Notably, the application can be deployed while existing users continue to interact with the website uninterrupted until a manual action is taken to remove the existing ‘green’ infrastructure.
The customer team has successfully updated its on-premise based pipelines to Amazon EKS, now deploying a record number of application updates for its website with rolling restart and blue/green deployments. Its services running in AWS have experienced higher availability and uptime giving the client team significant confidence in their ability to meet SLAs and grow customer satisfaction. Next, the customer plans to use AWS CloudFront for its content delivery in order to reduce global latency and improve its geodiversity.
Post Date: 01/17/2019