HPC on AWS: A Workload-Based Approach to Computing at Scale
- May 13, 2020
Bringing your HPC workloads to the cloud is a daunting challenge at first. You’ve probably got decades of history built into your current on-premises system. You know it inside and out. You know all its strengths and its warts. You probably spent a considerable amount of time testing and honing it so that it worked “just so”. But now, your CIO or your VP or your Director has a new cloud initiative and suddenly you must come up with a plan to translate decades of custom work and tuning into a public cloud setup. Where to begin?
A workload-based approach
The best way to start is to go through a period of discovery, wherein you assess your workloads based on a set of criteria. Afterward, you design and build out any infrastructure that you need, run a proof of concept and then transition it into production. Rinse and repeat for as many workloads as you want.
Cloud workload discovery
Start by cataloging your workloads into a simple spreadsheet. Use an objective rating system based on characteristics like network dependencies, data payload size, intellectual property considerations, spikes in demand, business value, and others based on your unique situation.
Characteristics to consider for good testing candidates:
- Runs with minimal network dependencies
- Small data payloads are better than large ones
- Something you can run yourself without having to involve an engineering or business team
- Can run with minimal resources for pipecleaning but can easily scale up to test running at scale
- For a Proof of Concept, data that does not have a high degree of sensitivity
ProTip: Don’t get bogged down in precision of ratings. Catalog, assign some values based on what you know and move on. Once you have objective criteria set and scored, you can then use it to come up with your candidate workloads -- I recommend you identify two to four.
During this phase, you’ll want to do a deep dive on your two to four candidate workloads. In addition to asking workload or workflow owners about any support pain points you should know about, here are some key points to consider in your assessment:
In one of my previous jobs we had an engineer leave the company. And in accordance with IT security policy, this engineer’s home directory was archived. Immediately, we started getting phone calls saying that a critical workload was failing all over the network. You guessed it, there was a critical file somewhere in that engineer’s home directory that HAD to be there for everything to work properly.
Pro Tip: If you’ve got workloads with these kinds of dependencies, it’s best to move along and find something simpler. If you simply must select a workload that you know has dependencies but don’t know what they are, I highly recommend Ellexus Breeze to give you a Bill of Materials on what NFS dependencies your workload has.
Data payload size
Some workloads have a very small initial payload size, even if the results balloon during execution. Transferring hundreds of gigabytes or even terabytes of information just to set up your workload will lengthen your testing cycles.
Pro Tip: Rather than waiting for data to arrive, choose a workload with a small payload. This way you can more rapidly run a variety of test cases.
Intellectual property considerations
Consider the Intellectual Property content of the workload. If your organization has special requirements for data protection such as HIPAA, you’ll want to consult your legal team around data handling procedures. Trying to start right off with your most sensitive data will lead to a long and frustrating cloud engagement.
Pro Tip: Workloads with lower IP risk will more quickly jumpstart your testing efforts.
Spikes in demand
For workloads that spike in demand, there are two strategies. In the first, you move the workload that is the cause of the spike to the cloud. In the second, you migrate another workload to “make room” for the workload that spikes.
Pro Tip: While it is tempting to focus on the spiking workload candidates, sometimes there are factors which make that impractical, such as the amount of data that must be transferred/ staged for successful execution or IP security constraints. Don’t overlook another workload which is easier to move and test, but which will free up enough capacity on-premises to accommodate your primary capacity driver.
Finally, you really must consider the business value of migrating a particular workload to the cloud. Will it speed your time to market? Is it a critical component of your pipeline? Does your on-premises infrastructure constantly struggle to meet peak demand for a time-sensitive workload but you can never justify purchasing the additional resources for that peak demand?
Pro Tip: While cloud gives you a lot of flexibility for such workloads, you’ll need to convince the business owners of the value. Although they may have assigned you this project, think about your workload assessment in their terms before approaching them with your conclusions.
Infrastructure build-out, PoC, and transition to production
After you’ve identified your workload, characterized its needs, and received buy-in to build-out infrastructure and conduct a proof of concept, the real work begins.
Pro Tip: Building infrastructure the “traditional” way will get you going, but it will not give you the solid foundation you need; for a secure, scalable, reproducible infrastructure, cloud best practices are critical.
Our experienced AWS architects can help you effectively navigate the transition to production. We work with your team to set up code pipelines that build infrastructure using cloud-native technologies like AWS CloudFormation. The result is that you can set up an HPC cluster identical to prior deployments, in a matter of hours, anywhere in the world.
*This was originally written by Flux7 Inc., which has become Flux7, an NTT DATA Services Company as of December 30, 2019