Snowflake is a cloud-based Data Warehouse as a Service (DWaaS) platform. Surveys indicate DWaaS deployments are growing rapidly as more organizations shift from on-premises systems to cloud data warehouses.
Why Snowflake Data Cloud?
Snowflake combines the power of data warehousing and big data platforms at a minimal cost compared to traditional data warehousing solutions. Additionally, Snowflake’s infrastructure partner includes the three major cloud service providers – Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure), making it a flexible platform.
Snowflake – Architecture
Snowflake architecture ensures that the storage and compute clusters remain separated. Compute clusters consist of stateless worker nodes to be scaled on demand. Each computer cluster is referred to as a virtual warehouse. All SQL queries on Snowflake run on these virtual warehouses.
The top layer of the Snowflake architecture is called the Cloud Service Layer. This layer is responsible for authentication, infrastructure management, metadata management, query parsing and optimization, and access control.
The Snowflake architecture is shown in the following figure.
Snowflake – Data Loading
Snowflake supports ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) for loading data.
In ETL, the data is first loaded into a staging environment, transformed to the required format, and then loaded into the data warehouse. In the case of ELT, the raw data is directly loaded into the warehouse. Transformation logic is applied to the data as and when required.
Snowflake can directly load data from cloud-based storage such as an AWS S3 bucket. Here, the data must be available in either CSV or JSON format. Data can also be loaded to Snowflake using its third-party partners, such as Hevo Data or Informatica.
ELT Technique of Data Loading
- Bulk Load
Snowflake bulk loads data in JSON or CSV format. A staging environment is required to host the files, such as AWS S3, Google cloud storage, etc.
Loading bulk data from an S3 bucket requires setting up an IAM role at AWS. The IAM role should be given read and write permissions to the bucket. The virtual private cloud ID of the Snowflake worksheet has to be configured for secure access to the S3 bucket.
Once the configuration is ready, data will be loaded from the S3 bucket to Snowflake. The schema of the Snowflake table is automatically created based on the fields present in the uploaded files.
- Loading Data from AWS S3 buckets using Snowpipe
Snowpipe enables loading files from a staging platform, such as an AWS S3 bucket, to Snowflake as and when new files are available. The initial setup is similar to the format for bulk loading.
Additionally, an SQS (Simple Queue Service) notification has to be configured at the AWS end that notifies Snowpipe as soon as a new file is uploaded to the S3 bucket. Snowpipe then runs the bulk loading process to load the data into Snowpipe.
ETL Technique of Data Loading:
- Using Third-Party ETL Tools of Snowflake
Hevo Data is a third-party partner of Snowflake that allows ETL from any cloud-based source. The source may include files uploaded in a cloud storage system such as Google Drive, AWS S3 bucket, etc., or any managed relational or non-relational database.
If the database is non-relational, such as Mongo DB, Hevo Data is responsible for converting it into a structured format before pushing it to Snowflake.
Writing a custom transformation logic in Python within Hevo Data is also possible. The transformation logic is applied directly to the extracted data. The transformed data is then pushed into Snowflake.
- Custom ETL Using Snowpipe REST API
This approach requires setting up a Snowpipe instance to load data into Snowflake from a cloud staging storage such as an S3 bucket. In this case, the SQS notification is not required to be set up.
A custom ETL pipeline involves a computing stage where the data is fetched from any source. The transformation logic may be applied to the data by writing code in any language. Once the information is transformed, it has to be saved in either CSV or JSON format in an AWS S3 bucket.
Once the files are uploaded to the S3 bucket, the Snowflake REST API is called with the list of files to be uploaded. Snowpipe then loads the data into Snowflake. Another Snowpipe REST API can be used to monitor the status of the loading process.
Snowflake – Data Visualization
Snowflake supports SQL queries for fetching data. Third-party visualization platforms such as Tableau are used to visualize the data. A Tableau dashboard has to be set up with the credentials of Snowflake. Each dashboard has an initial query that fetches the required data from Snowflake. Custom transformations and visualizations may be applied to the data as needed.
Snowflake also has a native connector for Python and Node.js. Data may be directly imported to a custom application from Snowflake by passing an SQL query using these connectors. Custom modules for transformation, cleaning, and visualization of the data may be built as required.