3 Best Practices for On-Prem to Snowflake

  • December 01, 2022

At NTT DATA, we believe in delivering significant business outcomes for organizations across many industries with the use of the Snowflake Data Cloud. The Snowflake Data Cloud is available on all three major clouds (AWS, Azure, and GCP), and it supports a range of workloads including data warehousing, data lakes and data science. With years of experience under our belt, we’ve helped companies modernize and migrate away from costl, rigid, capital-intensive data warehousing appliances such as Teradata, Netezza, and Exadata.

Similarly, companies that have attempted to use big data platforms (Cloudera, Hortonworks, MapR), for data warehousing and BI interactivity, are aggressively moving those workloads to Snowflake. Why? These companies want to take advantage of consumption-based pricing, native SQL (no esoteric tools and skills required), zero infrastructure and management overhead, and unlimited, instant compute scalability — all while still paying by the second.

Each of these technology approaches, from cost-intensive DW appliances and traditional big data platforms, has at least one thing in common—the vast majority of those implementations were deployed on-premise.

Many of the applications and underlying databases that drive the operational and financial systems of record for these companies are on-prem (SAP, Oracle EBS, Oracle RDBMS, SQL Server, DB2, et al). Additionally, they also feed the data warehouses and analytics platforms.

As a result, we are constantly asked questions such as:

  • What do you recommend for getting data to the cloud and into Snowflake?
  • We are really impressed with Snowflake, but can you help us with data movement, data integration, data engineering, or data pipelining from our source systems in our data center?
  • How should I approach a bulk migration to Snowflake that is 20TB+?
  • What is the best way to acquire incremental loads and ensure a reliable change data capture process?

With these factors in mind, I want to share three best practices that we’ve learned when it comes to making the choice to move from on-prem to Snowflake.

#1 ELT over ETL

“Extract, Transform, Load” is the most common paradigm used in on-prem environments. In general, the ability to transform is limited by compute, the expensive cost of compute or any negative impacts to compute. These negative impacts could include those that have the potential to adversely affect key users and stakeholders that depended on a certain level of performance for their analytics, reports, and applications.

In an ETL experience that’s limited by compute, this would be like being limited by today’s morning commute. For example, you usually drive an SUV for your daily commute, but let’s say you try to convince your partner or a friend to borrow their vehicle. The chances of success are usually low — because you have your own vehicle — but even if you are successful, it’s only after the time-consuming process of convincing and justifying your reasons for borrowing their vehicle.

On the other hand, when you are faced with unexpected personal transportation issues, you can make use of options like Uber, Lyft, rental car, train, or bus. These options are typically low-cost, low-maintenance and available on-demand. They also give you the flexibility to choose your starting and ending destination to fit your needs, as well as without any long-term special skill development.

Much like having unlimited commute options, utilizing a modern ELT approach when you go from on-prem to the cloud gives you unlimited compute. With the use of ELT, you can push the processing (T—transformations) to Snowflake and leverage the low-cost, instant-on capability, unlimited scalability and elasticity for transformations and processing.

Could you lift and shift your ETL environment? Short answer: yes. Is it the best approach in most cases? No.

When choosing to go the ELT route, you can expect to avoid having to solve a lot of data movement-based performance issues. This is because you won’t be pulling data out of Snowflake to do transformations in another engine or moving data back and forth from system to system. As a result, you’ll gain both cost efficiency (in terms of real dollars) and compute efficiency.

If you are looking to use an ELT-based approach, Fivetran, Matillion, DBT and Talend are just a few of the solutions to consider. They are closely integrated with Snowflake and use Snowflake compute for transformations. As a result, they are valuable solutions for the success of your journey from on-prem to the cloud.

#2 Get to Know Your Data Warehouse: Snowflake (of course!)

Snowflake has some unique attributes that set it apart from anything else in the market today—most notably from on-prem solutions. These three aspects really stand out:

1. It just works!

This will be self-evident as you take it for a test drive. You can get $400 in free credits for 30 days to see for yourself.

2. Snowflake was developed in the cloud, for the cloud.

It is a true cloud-native architecture and more importantly, a SaaS-based solution. You do get to pick your cloud of choice for your Snowflake service from AWS, Azure, or GCP.

3. Snowflake was developed specifically for data warehousing and analytics workloads.

This means that Snowflake is not a transactional database, it’s not an operational data store, it’s not a historian, it’s not a time-series database, and it’s not a graph database. Snowflake’s platform makes it incredibly simple to store, manage, and process both structured AND semi-structured data.

Instead of continuing to move forward with traditional, on-prem methods, try diving into Snowflake’s cloud-native architecture. Then, apply what you’ve learned to how you run your data processes and analytics workloads in the cloud.

Caching is King

Snowflake has three levels of cache: result cache, local disk cache, and remote disk cache. The result cache holds the results of every query executed in the past 24 hours. These are available across Snowflake’s independent virtual warehouses. Query results returned to one user are available to anyone else who executes the same query (provided the underlying data has not changed). This is a powerful feature from both a performance and cost perspective.

On the Fly Capacity and Elasticity (Up and Down)

In the advanced analytics space, there are often unknown questions that need to be asked. Below are some of these questions that you can definitely ask of your Snowflake service.

What is impossible with traditional platforms:

  • Do I want to instantly resize a Snowflake warehouse to improve query performance with large complex queries or maybe auto-enable a multi-cluster warehouse for handling more concurrent users? You can do both/either.
  • Should I auto-suspend to take advantage of Snowflake’s per-second billing and not burn through credits when a warehouse is not in use? Which apps would fit this profile so that I avoid a lot of suspend/resumes? Easy to select in Snowflake.
  • How can I avoid warehouse compute contention and max capacity issues for my users and workloads? Hint: Snowflake virtual warehouses are completely independent operating from the same storage.
  • Will I still have to create an elaborate caching structure for my BI tools in order to avoid burdening Snowflake? Nope.

Yes, You Can Monitor Your Snowflake Usage and Costs

Once you understand how to think about ELT, data integration and get to know Snowflake, the next questions tend to revolve around cost monitoring.

For example:

  • This is the first time I’ll be moving data and analytics workloads to the cloud and Snowflake. How will I monitor costs to ensure that I don’t get an unwelcome surprise?
  • To what level of detail can I monitor users, services, type of usage, etc?
  • How many credits am I using on a daily, weekly or monthly basis? What are those costing me in real dollars?
  • What is my burn rate, how is it trending, and what are the prime contributors?
  • Who is consuming the most compute time in the service?
  • What’s my usage over the last 30 days?

There are many options depending on your requirements. These include daily or weekly summaries in a BI dashboard that could be sent out to a broad team or near real-time options with specific KPIs and cost metrics.

The data is then available in Snowflake’s Account Usage and ready to be accessed and reported on—all depending on what you need and the level of detail required for visibility. Caching, elasticity, and cost monitoring are just a few examples.

#3 Avoid Being Overwhelmed by Option Fatigue

When it comes to the options associated with data integration for moving from on-prem to the cloud, it’s easy to become overwhelmed and intimidated. You may feel pressured to select the perfect data integration tool—or the “one tool to rule them all.”

However, the reality is that, whether you are a startup, medium-sized organization, or large global enterprise, the proliferation of data means that you’re going to need to address a wide range of use case dimensions:

  • Data sources
  • Data targets
  • Data types
  • Data locations
  • Data volumes
  • Transformation requirements
  • Automation needs
  • Monitoring options
  • Incremental change needs
  • Varying bulk load sizes
  • Network considerations
  • Security parameters
  • UI and Ease of Use
  • OEM network
  • Organizational issues
  • Risk tolerance
  • Range of skill sets
  • Other priorities

All these dimensions (and more) will contribute to your dilemma of deciding what, when, why and how it will be used, as well as when to take a different approach or pick up another tool.

Is there a one-size-fits-all tool that will take care of all of the use cases, patterns, and scenarios? Even if you’re 100% comfortable that the non-technical factors are covered, additional factors should be considered such as:

  • Company viability
  • Ease of doing business
  • Number of deployments
  • Size of deployments
  • Pricing model
  • Contracting timeline and process

Those are just a few examples and your organization’s mileage may vary — what’s important to you or the most critical will be unique to your situation.

So how do I find a single tool that meets my individual technical and business criteria? Well, I hate to break it to you, but there is no “one tool to rule them all.”

Instead of focusing on the “perfect” integration solution—the “one tool to rule them all”—focus more on the key patterns you have. Most importantly, maintain your focus on driving data and analytics-driven business outcomes because you aren’t going to find the perfect solution. Instead of spending your time searching for perfection, it’s better to have three to four options in your data integration “toolbox” you can pull out based on the requirement, timing, and available skill sets.

So even if a company claims that they have the perfect solution that fits all use cases because it was the perfect fit for similar customers, that’s not possible. Each organization has very different approaches to technology stacks, overall company policies towards new tools, security restrictions, business processes, and, of course, global reach and localization.

Therefore, in order to help clients avoid “option fatigue” and get to a decision point, it’s important to demystify the range of data integration solutions.

In the on-prem world—especially with data warehousing workloads—it may seem easy to gravitate towards a single IT-centric solution. However, a modern approach demands thinking beyond a single tool to anticipate and stay ahead of the data and analytics demands of your stakeholders.

Typically, long, drawn-out, IT, data, and analytics projects aren’t welcomed—everyone wants to move quickly and move on to the next high-value area.

To avoid the option fatigue that comes along with selecting data integration tools, keep these tips in mind:

  1. Don’t get locked into a “unicorn” tool mindset.
  2. Stay outcome-focused and keep it simple (until you can’t).
  3. Determine the correct balance between technical and business fit for your organization.
  4. Be realistic about your delivery capabilities and get help if you need it.
  5. Think in terms of a 4S model: Simple, Speedy, Sustainable, and Self-Serve.

Go Modern

Don’t think about the cloud and Snowflake as a lift-and-shift of your current ETL environment. Instead, use this as an opportunity to modernize your approach to deliver high-value outcomes and avoid the pitfalls that others have dealt with along the way.

Beginning the journey down the path will require some key decisions, but your decisions will make an impact. With a sound strategy, team, approach, and a few tool options, you can make the most out of going from on-prem to the cloud with Snowflake.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, NTT DATA is your Trusted Global Innovator. We offer consulting services that will help you discover the possibilities of Snowflake, and we would love to work through your specifics in this area so you can achieve your goals of successfully moving from on-prem to the cloud.

Related Blog Posts