We are surrounded by data but starved for insights.”
~ Jay Baer, marketing and customer experience expert
Living in a data-driven world satisfies the need for information. However, data proves useful only while providing insights. There is a massive gap between being data-conscious and being insights-driven. Thus, a robust process to mine out useful information from the scores of data is a prime necessity.
What is Cloud Data Warehousing?
Data warehousing (DW) is the process of digitally collecting, analyzing and managing business data from varied heterogeneous sources to create meaningful insights. The core of business intelligence i.e. data analysis and reporting are possible only because of the presence of the data warehouse. Thus, data warehouses are electronic storage of information that can be converted into useful insights as demanded by the user’s requirements.
Data warehouses can be implemented by companies
- On-premise
- In the cloud (this being cloud data warehousing, which is what this blog is about)
- As a mixture of both
Cloud data warehousing incorporates data in a public cloud and optimizes it for scaling business analytics.
Key Features
- MPP (Massively Parallel Processing)
Cloud data warehouses involved with big data projects consist of a multitude of servers running parallel to distribute processing and loads.
- Columnar Data Stores
Cloud data warehouses process data by columns rather than rows making it dramatically faster.
Data Warehouse Architecture
A typical data warehouse has three separate layers – Data Layer, Semantics Layer, and Analytics Layer.
- Data Layer
The bottom tier – extracts, converts, and loads data from various sources using ETL technologies into the database server, data marts, and data lakes. In this tier, metadata is developed, and data integration methods such as data virtualization are utilized to seamlessly merge and aggregate data.
- Semantics Layer
This is the middle tier that rearranges data for rapid, complex queries and analyses using Online analytical processing (OLAP) and online transactional processing (OLTP) servers.
- Analytics Layer
The front-end client layer is the highest tier. It has data warehouse access capabilities. This renders the users capable of interacting with data –
- Creating dashboards and reports
- Tracking KPIs
- Mining and analyzing data
- Constructing apps, etc.
This level might feature a workbench or sandbox area for data exploration and new data model development.
Thus, the architecture declares the 4 basic components of data warehousing-
- A central database
- ETL (extract, transform, load) tools
- Metadata
- Access tools
The 3 Types of Data Warehouses
- Enterprise data warehouse (EDW)
- Operational data store (ODS)
- Data Mart
- Enterprise Data Warehouse (EDW)
It is a centralized warehouse and provides:
- decision support services
- a uniform approach to data organization and representation.
- data categorization by subject
- Operational Data Store (ODS)
An ODS comes in handy when an organization’s reporting need cannot be met by either a data warehouse or an OLTP system. The data warehouse in ODS is refreshed in real-time.
- Data Mart
A data mart is developed for a special business line, for eg. sales or funding. It is a subdivision of a data warehouse. An independent data mart aids data collection directly from sources.
The 4 Stages of Data Warehousing
- Offline Operational Database
- Offline Data Warehouse
- Real-time Data Warehouse
- Integrated Data Warehouse
- Offline Operational Database
In this stage, data is simply copied from the operational system to another server and does not influence the system’s performance.
- Offline Data Warehouse
The data warehouse receives regular data updates from the Operational Database – mapped and changed to fulfill objectives.
- Real-time Data Warehouse
At this step, whenever a transaction occurs in the operational database, the data warehouse is updated automatically.
- Integrated Data Warehouse
At this level, data warehouses are regularly updated when the operating system makes a transaction. This triggers the data warehouse to generate transactions, to be returned to the operational system.
Advantages of Using Cloud Data Warehousing Tech
Cloud-based data warehouses are on a popularity rise. It is used in a wide range of sectors from airlines to healthcare to banking and investment. Let’s take a look at the reasons behind its preference for on-premise warehouses-
- Real-time cloud analytics – gathers data over time and provides accurate analysis.
- Elastic – able to integrate data from various sources.
- Scalable – can manage huge amounts of data.
- Focused – subject-oriented, emphasizes a particular area
- Low cost of operation
Comparison Between 4 Cloud Data Warehousing Technologies
All of the below-mentioned cloud data warehousing providers are fully managed services that allow for fast querying and data analysis using SQL.
- Amazon Redshift
- Google BigQuery
- Azure Synapse Analytics
- Snowflake
- Amazon Redshift
- A cloud data warehousing technology, provided by Amazon
- Handles datasets of sizes ranging from a few gigabytes to a petabyte
- Serves big names such as Pfizer, Pizza Hut, Nasdaq, and Equinox.
Source: Cloud Data Warehouse – Amazon Redshift– Amazon Web Services
- Google BigQuery
- A multi-cloud-based data warehousing technology, provided by Google
- Completely serverless and built-in machine learning
- Rapidly analyzes terabytes to petabytes of data
- Serves big names such as Confluent, Informatica, Striim, and Funnel
Source: BigQuery Enterprise Data Warehouse | Google Cloud
- Azure Synapse Analytics
- A cloud data warehousing technology, provided by Microsoft
- Provides the freedom to query data on one’s terms, either serverless or dedicated options—at scale.
- Has more certifications than any other cloud provider
- Serves big names such as Walgreens, M&S, Neogrid, and CCC Marketing.
- Snowflake
- A cloud-based data warehouse, that runs no cloud of its own, and uses major cloud providers AWS, GCP, and Azure
- The architecture ensures that the storage and compute clusters remain separated
- Data integration is in-built
- Serves big names such as Capital One, JetBlue, Anthem, and HubSpot
To read more on Snowflake, Link to the Snowflake blog
Let’s look at a comparative study chart of their services:
Amazon Redshift | Google BigQuery | Azure Synapse Analytics | Snowflake | |
Offered by | Amazon | Microsoft | Doesn’t run its cloud, uses major public cloud vendors | |
Administration & Management | Instance size to be selected correctly, and configuring and scaling nodes manually, hence requires AWS expertise | Completely serverless – provisions are automatic | Both serverless and dedicated options are available | Simple features. Cloud provider and virtual warehouse sizes to be selected |
Storing Data | Amazon Simple Storage Service (S3) | Google Cloud Storage | Azure Data Lake Storage | No storage. Uses AWS, GCP, or Azure** |
Scalability | Decoupled storage. Computes with RA3 nodes | Storage and compute scales independently | Serverless option – scales independently.Dedicated option – extra storage added manually | Users can scale storage. Computes independently. Automatically adds and removes nodes |
Data Integration Ecosystem | AWS Glue | Google Cloud Dataflow | Azure Data Factory | In-built |
Analytics Ecosystem | For BI – AWS Quicksight For data – a roster of analytics platforms. | For data – Google workplace (simple Google Sheets upload) + Google Cloud. For BI – Looker. | For analytics – Azure ecosystem + PowerBI. For NoSQL – CosmosDB. | Completely based on the Snowflake platform |
Ingestion of Streaming Data | No in-built capability | Possible. Code that calls on the streaming API can be used to insert records one by one. | Yes, using Apache Spark streaming functionality | Possible with Snowpipe |
Data Backup and Recovery | Yes |
**It is preferable to host Snowflake in the cloud that hosts all other services within the company for seamless integration. For example, if all of your other services are hosted on AWS, host Snowflake there as well.