Cloud Data Warehousing – Everything You Need To Know

We are surrounded by data but starved for insights.”

~ Jay Baer, marketing and customer experience expert

Living in a data-driven world satisfies the need for information. However, data proves useful only while providing insights. There is a massive gap between being data-conscious and being insights-driven. Thus, a robust process to mine out useful information from the scores of data is a prime necessity.

What is Cloud Data Warehousing?

Data warehousing (DW) is the process of digitally collecting, analyzing and managing business data from varied heterogeneous sources to create meaningful insights. The core of business intelligence i.e. data analysis and reporting are possible only because of the presence of the data warehouse. Thus, data warehouses are electronic storage of information that can be converted into useful insights as demanded by the user’s requirements.

Data warehouses can be implemented by companies

On-premise
In the cloud (this being cloud data warehousing, which is what this blog is about)
As a mixture of both

Cloud data warehousing incorporates data in a public cloud and optimizes it for scaling business analytics.

Key Features

MPP (Massively Parallel Processing)

Cloud data warehouses involved with big data projects consist of a multitude of servers running parallel to distribute processing and loads.

Columnar Data Stores

Cloud data warehouses process data by columns rather than rows making it dramatically faster.

Data Warehouse Architecture

A typical data warehouse has three separate layers – Data Layer, Semantics Layer, and Analytics Layer.

Data Layer

The bottom tier – extracts, converts, and loads data from various sources using ETL technologies into the database server, data marts, and data lakes. In this tier, metadata is developed, and data integration methods such as data virtualization are utilized to seamlessly merge and aggregate data.

Semantics Layer

This is the middle tier that rearranges data for rapid, complex queries and analyses using Online analytical processing (OLAP) and online transactional processing (OLTP) servers.

Analytics Layer

The front-end client layer is the highest tier. It has data warehouse access capabilities. This renders the users capable of interacting with data –

Creating dashboards and reports
Tracking KPIs
Mining and analyzing data
Constructing apps, etc.

This level might feature a workbench or sandbox area for data exploration and new data model development.

Thus, the architecture declares the 4 basic components of data warehousing-

A central database
ETL (extract, transform, load) tools
Metadata
Access tools

The 3 Types of Data Warehouses

Enterprise data warehouse (EDW)
Operational data store (ODS)
Data Mart

Enterprise Data Warehouse (EDW)

It is a centralized warehouse and provides:

decision support services
a uniform approach to data organization and representation.
data categorization by subject

Operational Data Store (ODS)

An ODS comes in handy when an organization’s reporting need cannot be met by either a data warehouse or an OLTP system. The data warehouse in ODS is refreshed in real-time.

Data Mart

A data mart is developed for a special business line, for eg. sales or funding. It is a subdivision of a data warehouse. An independent data mart aids data collection directly from sources.

The 4 Stages of Data Warehousing

Offline Operational Database
Offline Data Warehouse
Real-time Data Warehouse
Integrated Data Warehouse

Offline Operational Database

In this stage, data is simply copied from the operational system to another server and does not influence the system’s performance.

Offline Data Warehouse

The data warehouse receives regular data updates from the Operational Database – mapped and changed to fulfill objectives.

Real-time Data Warehouse

At this step, whenever a transaction occurs in the operational database, the data warehouse is updated automatically.

Integrated Data Warehouse

At this level, data warehouses are regularly updated when the operating system makes a transaction. This triggers the data warehouse to generate transactions, to be returned to the operational system.

Advantages of Using Cloud Data Warehousing Tech

Cloud-based data warehouses are on a popularity rise. It is used in a wide range of sectors from airlines to healthcare to banking and investment. Let’s take a look at the reasons behind its preference for on-premise warehouses-

Real-time cloud analytics – gathers data over time and provides accurate analysis.
Elastic – able to integrate data from various sources.
Scalable – can manage huge amounts of data.
Focused – subject-oriented, emphasizes a particular area
Low cost of operation

Comparison Between 4 Cloud Data Warehousing Technologies

All of the below-mentioned cloud data warehousing providers are fully managed services that allow for fast querying and data analysis using SQL.

Amazon Redshift
Google BigQuery
Azure Synapse Analytics
Snowflake

Amazon Redshift
- A cloud data warehousing technology, provided by Amazon
- Handles datasets of sizes ranging from a few gigabytes to a petabyte
- Serves big names such as Pfizer, Pizza Hut, Nasdaq, and Equinox.

Source: Cloud Data Warehouse – Amazon Redshift– Amazon Web Services

Google BigQuery
- A multi-cloud-based data warehousing technology, provided by Google
- Completely serverless and built-in machine learning
- Rapidly analyzes terabytes to petabytes of data
- Serves big names such as Confluent, Informatica, Striim, and Funnel

Source: BigQuery Enterprise Data Warehouse | Google Cloud

Azure Synapse Analytics
- A cloud data warehousing technology, provided by Microsoft
- Provides the freedom to query data on one’s terms, either serverless or dedicated options—at scale.
- Has more certifications than any other cloud provider
- Serves big names such as Walgreens, M&S, Neogrid, and CCC Marketing.

Snowflake
- A cloud-based data warehouse, that runs no cloud of its own, and uses major cloud providers AWS, GCP, and Azure
- The architecture ensures that the storage and compute clusters remain separated
- Data integration is in-built
- Serves big names such as Capital One, JetBlue, Anthem, and HubSpot

To read more on Snowflake, Link to the Snowflake blog

Let’s look at a comparative study chart of their services:

	Amazon Redshift	Google BigQuery	Azure Synapse Analytics	Snowflake
Offered by	Amazon	Google	Microsoft	Doesn’t run its cloud, uses major public cloud vendors
Administration & Management	Instance size to be selected correctly, and configuring and scaling nodes manually, hence requires AWS expertise	Completely serverless – provisions are automatic	Both serverless and dedicated options are available	Simple features. Cloud provider and virtual warehouse sizes to be selected
Storing Data	Amazon Simple Storage Service (S3)	Google Cloud Storage	Azure Data Lake Storage	No storage. Uses AWS, GCP, or Azure**
Scalability	Decoupled storage. Computes with RA3 nodes	Storage and compute scales independently	Serverless option – scales independently.Dedicated option – extra storage added manually	Users can scale storage. Computes independently. Automatically adds and removes nodes
Data Integration Ecosystem	AWS Glue	Google Cloud Dataflow	Azure Data Factory	In-built
Analytics Ecosystem	For BI – AWS Quicksight For data – a roster of analytics platforms.	For data – Google workplace (simple Google Sheets upload) + Google Cloud. For BI – Looker.	For analytics – Azure ecosystem + PowerBI. For NoSQL – CosmosDB.	Completely based on the Snowflake platform
Ingestion of Streaming Data	No in-built capability	Possible. Code that calls on the streaming API can be used to insert records one by one.	Yes, using Apache Spark streaming functionality	Possible with Snowpipe
Data Backup and Recovery	Yes

**It is preferable to host Snowflake in the cloud that hosts all other services within the company for seamless integration. For example, if all of your other services are hosted on AWS, host Snowflake there as well.