Data Build Tool (DBT) integration with Snowflake
- October 06, 2021
This tech article describes the integration of DBT with Snowflake.
Figure 1. Role of DBT in data analytics
Introduction to DBT
Data Build Tool (DBT) is a Python application that acts as a transformation tool. It focuses on the transformation part of extract, load, transform (ELT) processes. DBT performs data engineering activities before the data is used for analytics. It transforms data using select statements, and in turn converts the results into tables and views, making the transformation process simple and effective.
DBT comes in two variants:
- DBT Command Line Interface (CLI), which is run by a terminal
- DBT Cloud, which is a web-based application with an integrated development environment (IDE)
Figure 2. Data Build Tool
Key features of DBT
- Documentation. DBT documentation is updated when models are developed. It’s easily accessible throughout the development process. The documentation is generated based on the descriptions provided, the dependencies between models, SQL files, sources and the tests defined.
- Data lineage. The data pipeline in DBT is represented in the form of lineage graphs. This makes data visible and shows how the data maps with the business logic. The complete flow of data — from the source to the target system — is shown in the graph, providing transparency.
- Version control. The integration of GIT with DBT brings version control. All the models, tests generated, sources, packages and other configurations used in a project are versioned in the connected GIT repository.
- Testing. DBT comes with unique, prebuilt, not null referential integrity, as well as accepted value testing. You can use Jinja and SQL to write custom test cases to add testing features beyond the default tests.
- Reusability. In DBT, it’s possible to develop models that can be reused — using the Jinja framework.
- Data refreshes within DBT Cloud. Using DBT, it’s possible to schedule a refresh at the production environment based on business requirements without using an orchestration tool.
Why DBT and Snowflake?
Snowflake is a data warehouse hosted as a software-as-a-service (SaaS) option, which is faster, user friendly and more flexible than a traditional data warehouse. Snowflake’s data warehouse was developed using a SQL database engine, with an architecture specifically designed for the cloud environment. Snowflake was built using a multicluster, shared data architecture. This approach separates the data storage and compute layer from the query processing layer.
DBT is an open-source tool that’s used to manage the ELT load in Snowflake. Features include the following:
- Converting tables to views. It’s sufficient to change the materialization in a single config file to change a table to a view.
- Storing procedures. The stored procedures created in DBT are shown in the models, which can be accessed and modified.
- Combining transformation logic. DBT groups similar transformation logic together using DBT tags.
- Providing version control. DBT supports version control by integrating with GitHub.
There’s also an active open-source community, which can help enhance development by sharing experiences from fellow developers rather than starting from scratch.
Integration of DBT with Snowflake
Step 1: Create a Snowflake account
Create an account with Snowflake. Pay attention to the highlighted details, which will be used to connect the database to DBT.
Figure 3. Snowflake Worksheet
Step 2: Create a DBT Cloud account and set up a new project
- Create a DBT account using Try DBT Cloud Free and sign in
- Create a new project following the steps below
Figure 4. DBT Project
- Click Begin to proceed
- On the next screen, give an appropriate DBT project name and hit continue
- On the next screen, select Snowflake from the list of data warehouses
Step 3: Connect to a Snowflake instance
The following fields are required when creating a Snowflake connection:
- Account. This is the Snowflake account to which you will connect. If the URL for the Snowflake account is, for example, abc12345.east-us-2.azure.snowflakecomputing.com, then the account name should be abc12345.east-us-2.azure.
- Role. This is the role you’ll use after connecting to Snowflake (optional field).
- Database. Establish a connection with this logical database in the data warehouse to run queries.
- Warehouse. This is the virtual warehouse you’ll use to run queries.
- Authentication method:
- Username/password. Enter the Snowflake username (specifically, the login_name) and the corresponding user’s Snowflake password, which will be used to authenticate DBT Cloud to run queries in Snowflake on behalf of the Snowflake user.
- Key pair. The key pair authentication method is based on Snowflake’s Key Pair Authentication and is used to authenticate the user’s credentials when accessed from a DBT Cloud project. After generating an encrypted key pair, set rsa_public_key for the Snowflake user for the authentication process.
Figure 5. Snowflake Connection with DBT
Once the credentials are given, test the connection. If the connection is successful, proceed to connect DBT with an empty GIT repository.
References:
https://docs.getdbt.com/docs/introduction
— By Rini Shiny M