Urgent AWS Migration Shows Need for Automation - Preparation - Flexibility
- February 14, 2017
We have been working closely with a customer who is undergoing a business transformation. As a multimedia equipment manufacturer, the organization has a loyal following of its high-quality devices. However, like many companies facing the convergence of markets and new customer demands, the company has embarked on a metamorphosis. Traditionally very focused on hardware, their software was largely ignored even though it offered customers real value. Part of the company’s transformation was a move to treat their software like a full-fledged offering, rather than a free supplement. An upcoming product release marked the first (and biggest steps), in cementing this change in company direction.
Flux7, a DevOps consulting firm based in Austin, TX was called in to help usher the organization through an IT transformation and AWS migration in support of the company’s metamorphosis. We employed our three-step engagement process — assessment/design, implementation, and knowledge transfer– with this customer, working initially on an AWS architecture for the group, which led to the development of an infrastructure delivery process. After establishing this foundation, we began work clearing the company’s backlog.
During this time, our engineers noticed an issue that affected the customer experience. Their AWS VPC CIDR overlapped with a public IP space that had been blocking a large number of customers from using their website. We later learned that the customer’s AWS accounts and infrastructure was initially set up by developers who did not have a solid understanding of networking and in their set up, they set the VPC to overlap with a public IP address space. The environment was created and the number of apps kept on growing over time. As the team structure changed, people realized this was an issue and added it to the ‘fix list’, yet no one realized just how big of an issue it was. It was assumed that it was only affecting outbound traffic and it was not given high priority because the overlapping IP ranges were not considered important.
When we informed the customer of the issue, their first reaction was that it was already on their list of things to fix. However, when we informed them that it was actually impacting their customers on daily basis, it immediately intensified the discussion — raising the fix from a ‘nice to have’ to ‘urgently needed today’. Unfortunately, the only way to fix the issue was to completely migrate the environment — including its nearly 100 production application services with external customers and inter-dependencies, existing and legacy assets, and a beta product — to a new VPC (read our blog about AWS VPC best practices here). To make things even more challenging, the product release was scheduled only two months from the day the issue was discovered. And it needed to be fixed prior to launch.
Our approach to this project was like the woodsman who was asked, “What would you do if you had just five minutes to chop down a tree?” To which he answered, “I would spend the first two and a half minutes sharpening my axe.” And that’s exactly what we did. We held daily brainstorming sessions in which we actively identified problems, resources available for problem resolution, near-term goals and tracking for project completion. We also mapped items that could go wrong, designed experiments, and created checklists and runbooks. As there was no room for error, we aggressively tackled any and all issues before they could become a real concern. While we didn’t have the luxury to scope things out in advance and know the plan stepping in, we were able to address this challenge with these regular sessions which also helped us tackle common challenges like fitting work into overnight maintenance windows.
We also took several measures to ensure business continuity. To ensure thoroughness in our planning, we tested all parts of the process. For any unknowns, we designed PoCs to install confidence in our plan. We created formal checklists, complete with rollback plans and a notification hierarchy. And, we created sandbox environments.
Luckily, the customer was a heavy user of Ansible and had used it to create their environment and instances. This existing automation was a godsend as we began work as we were able to quickly create new environments; setup sandbox routers to route to the new services, and have QA perform manual and automated testing using the sandbox routers and application servers.
Indeed, the biggest reason the project succeeded was because the customer was a heavy user of Ansible and already had very good automation. From day one, our AWS consultants took advantage of the pre-existing automation and used it to bring up servers and load balancers similar to what already existed. We also used Ansible to configure quick test environments for any changes we were rolling out, running test suites on them.
In addition, the customer was using an NGINX server as a web proxy in front of the internal services. The configuration of this server was also set through Ansible. This gave us a good point from which to manipulate the configuration for internal testing. Thankfully the customer had steadily invested in automation with a dedicated DevOps team which resulted in a solid foundation to work from.
Striking a Balance
As the project moved forward, we learned that there was a balance we needed to maintain between keeping the scope on point and fixing issues that arose. For example, an issue we observed was that this client was not using IAM Instance Profiles. For those unfamiliar with the concept, IAM profiles are a way to attach an IAM role to an ec2-instance such that any application running on that instance can access the AWS services permitted by that role. It thus eliminates the need to hard code AWS API keys in configuration files of apps and is considered an excellent security control by auditors.
However, introducing IAM roles at this client would have required coordination with the application team and would have also required additional QA. At the same time, IAM profiles have a shortcoming that they cannot be attached to a running instance, i.e., you can only attach a profile when you create the instance. (Note that this shortcoming has since been addressed: look for a story from us on that shortly.) Since we were creating new instances for all applications, we did not want to miss the opportunity to enable the customer to use IAM roles in the future. Thus, we made the decision to attach IAM instance profiles with empty policies. This had minimal impact on our effort but with the “empty” profile attached to the running instance, the customer could easily switch to the use of roles in the future.
With a tight timeframe and critical issue to solve, we learned a lot in the course of this project. The most important lessons we learned were to:
- Leverage Automation. A solid investment in DevOps and automation with Ansible as its backbone led to significant efficiency and productivity. Without this automation, we would not have successfully met the project’s deadline.
- Maintain flexibility. Flexibility in our schedule — and our attitude — really allowed us to do what was necessary to achieve the desired result in the given timeframe.
- Seek Balance: Analyzing the three key “fix now or fix later” considerations — the value of the fix, the cost to make the fix today, and the cost of fixing it later — helped us ensure we were striking the right balance.
- Communicate: Frequent communication coupled with a philosophy of creativity and proactivity in addressing a spectrum of potential scenarios resulted in the ability to quickly and effectively triage any concern that raised its head.
Applying these lessons, we were able to successfully deliver the migration within the two-month timeframe, while:
- Building in AWS security best practices and enhancements
- Implementing automatic monitoring of mission-critical infrastructure
- Completing it in time to give the application teams ample time to get comfortable with the environment
- Updating and optimizing configuration management Ansible libraries
- Optimizing Docker infrastructure components for ease of maintenance and future enhancements
Just as importantly, the new product launch was a smash hit with all customers now able to use the website to manage their use of the company’s products. The best part is that this company’s customers are able to enjoy their new product without hiccup or frustration given our ability to quickly migrate this company’s environment.
We hope that you gain as much from these lessons as we did. The right architecture is critical to success, starting with properly setting up your AWS Account . If you’d like to know more about our approach to DevOps, automation, and project management to achieve specific business goals, please read more about our DevOps Consulting. Or, for additional best practices and lessons learned, please subscribe to our blog.