Enterprise data warehouses (EDWs) became necessary in the 1980s when organizations shifted from using data for operational decisions to using data to fuel critical business decisions. Data warehouses differ from operational databases in that while operational transactional databases collate data for multiple transactional purposes, data warehouses aggregate this transactional data for analytics.
Data warehouses are popular because they help break down data silos and ensure data consistency. You can aggregate and analyze relevant data from multiple sources without worrying about inconsistent and inaccessible data. This consistency promotes data integrity, so you can trust the insights to make informed decisions. Additionally, data warehouses are great at offering historical intelligence. Because data warehouses collect large amounts of historical data over time, you can access and evaluate your previous decisions, identify winning trends, and adjust strategies as needed.
However, organizations today are moving beyond just batch analytics on historical data. Internal users and customers alike are demanding speedy updates based on real-time data. With much of the data centralized in their data warehouse, data teams try to continue to leverage the data warehouse for these new real-time needs. Often though, they learn that data warehouses are too slow and too expensive to run low latency, high concurrency workloads on real-time data.
In this article, we’ll explore the strengths and shortcomings of three prominent data warehouses today: Google BigQuery, Amazon Redshift, and Snowflake. We’ll specifically highlight how they may not be the best solutions for real-time analytics.
BigQuery is Google’s data warehouse service and one of the first cloud data warehouses released to the public. This fast, serverless, highly scalable, and cost-effective multi-cloud data warehouse has built-in machine learning, business intelligence, and geospatial analysis capabilities for querying massive amounts of structured and semi-structured data.
BigQuery pricing has two main components: query processing costs and storage costs. For query processing, BigQuery charges $5 per TB of data processed by each query, with the first TB of data per month free. For storage, BigQuery offers up to 10GB of free data storage per month and $0.02 per additional GB of active storage, making it very economical for storing large amounts of historical data.
BigQuery provisions infrastructure and resources, automatically scaling compute capabilities and storage capacity up to petabytes of data based on your organization’s needs. This feature helps you focus on gaining valuable insights from your data instead of spending time on infrastructure and warehouse management.
Its high-speed streaming ingestion API (up to 3GB per second of data input) helps analysis and reporting. After ingesting the data, BigQuery employs its built-in machine learning and visualization features to create dashboards for making important decisions.
BigQuery aims to provide fast queries on massive datasets. However, the data via its streaming API insert isn’t available for two to three minutes. So, it’s not real-time data.
Amazon Redshift cloud data warehouse is a fully-managed SQL analytics service. It analyzes structured and unstructured data from other warehouses, operational databases, and data lakes.
Pricing starts at $0.25 per hour and then scales up or down depending on usage. Redshift can scale up to exabytes of storage data, making it an excellent option if you’re handling extensive datasets.
It integrates with the Amazon Kinesis Data Firehose extract, transform, and load (ETL) service. This integration quickly ingests streaming data and analyzes it for quick use. However, this ingested data isn’t available immediately. Because there is a 60-second buffering delay, the information is near real-time rather than actually real-time.
As with all data warehouses, Redshift query performance is not real-time. One way to increase query speed is to select the ideal sort and distribution keys. However, this method requires prior knowledge of the intended query, which isn’t always possible. So, Redshift may not be ideal for fast, ad-hoc real-time queries.
Snowflake cloud data warehouse has become an increasingly popular option. Snowflake provides quick and easy SQL analytics on structured and semi-structured data. You can provision compute resources to get started with this service.
Snowflake’s high-performance, flexible architecture also enables you to scale your Snowflake use up and down, with per-second pricing. Snowflake’s separate compute and storage functions scale independently, allowing more pricing flexibility. Cost can be difficult to estimate as it’s obscured by credits, but pricing starts at $2 per credit for compute resources and $40/TB per month for active storage. Even though Snowflake is a fully managed service, you need to select a cloud provider (AWS, Azure, or Google Cloud) to start.
The Snowpipe feature manages continuous data ingestion. However, this continuous streaming data isn’t available for a few minutes. This delay makes it unappealing for real-time analytics because you can’t query data immediately. Snowpipe costs can also increase dramatically as more file ingestions are triggered.
Finally, as with all scan-based systems, though Snowflake can return complex query results fast, this can take many minutes. It’s a sub-par solution for real-time analytics. Paying for larger virtual warehouses leads to faster performance, but the results are still too slow for real-time analytics.
Three Reasons Data Warehouses Aren’t Made For Real-Time Data
While data warehouses have their strengths — especially when it comes to processing large amounts of historical data — they aren’t ideal for processing low latency, high concurrency workloads on real-time data. This is true for the three data warehouses mentioned above. Here are the reasons why.
First, data warehouses are not built for mutability, a necessity for real-time data analytics. To ensure fast analytics on real-time data, your data store must be able to update data quickly as it comes in. This is especially true for event streams because multiple events can reflect the true state of a real-life object. Or network problems or software crashes can cause data to be delivered late. Late-arriving events need to be reloaded or backfilled.
Instead, data warehouses have an immutable data structure because data that doesn’t need to be continuously checked against the original source is easier to scale and manage. However, because of immutability, data warehouses expend significant processing power and time to update data, resulting in high data latency that can rule out real-time analytics.
Second, data warehouses have high query latency. This is because data warehouses don’t rely on indexes for fast queries and instead organize data into its compressed, columnar format. Without indexes, data warehouses must run heavy scans through large portions of the data for each query. This can result in queries taking tens of seconds or longer to run, especially as data size or query complexity grows.
Finally, data warehouses require extensive data modeling and ETL work to ensure the data is high quality, consistent, and well structured for running applications and achieving consistent results. Not only is it resource-intensive and time-consuming to build and maintain these data pipelines, but they are also relatively rigid so new requirements that emerge later on need new pipelines, which add significant cost and complexity. Processing the data also adds latency and reduces the value of the data for real-time needs.
A Real-Time Analytics Database To Complement the Data Warehouse
Rockset is a fully managed, cloud-native service provider that enables sub-second queries on fresh data for customer-facing data applications and dashboards. Although Rockset isn’t a data warehouse and doesn’t replace one, it works well to complement data warehouses such as Snowflake to perform real-time analytics on large datasets.
Unlike data warehouses that store data in columnar format, Rockset indexes all fields, including nested fields, in a Converged Index. Rockset’s cost-based query optimizer leverages the Converged Index to automatically find the most efficient way to run low latency queries. It does this by exploiting selective query patterns within the indexed data and accelerating aggregations over large numbers of records. Rockset does not scan any faster than a cloud data warehouse. It simply tries really hard to avoid full scans altogether allowing Rockset to run sub-second queries on billions of data rows.
Like Snowflake and BigQuery, Rockset separates storage costs from compute costs. So you only pay for what you need. Its pay-as-you-go model also ensures that you pay for only what you use.
Although Rockset isn’t suitable for storing large volumes of less frequently used data, it’s an excellent option for performing real-time analytics on terabyte-sized active datasets. Rockset can provide query results with milliseconds of latency within two seconds of data generation.
For example, Ritual, a health-meets-technology company, needed real-time analytics to better personalize the buying experience on their website. Ritual uses Snowflake as their cloud data warehouse, but found the query performance too slow for their needs. Rockset was brought in to complement Snowflake. By leveraging Rockset’s built-in connection with Snowflake, Ritual was able to immediately query both historical and new data almost instantly and serve sub-second latency personalized offers across their entire customer base.
Data warehouses became popular with the need to understand the large amounts of data that were being collected. The three most popular data warehouses today, Google BigQuery, Amazon Redshift, and Snowflake continue to be important tools to analyze historical data for batch analytics. Without a data warehouse, it can be difficult to get a precise picture to draw insights and make profitable decisions.
However, although most cloud data warehouses can perform multiple, complex queries on enormous datasets, they’re not ideal for building real-time solutions for data applications. This is because data warehouses were not built for low latency, high concurrency workloads. The data in a data warehouse is immutable, making it expensive and slow to make frequent small updates. The columnar format and lack of automatic indexing also slow down performance and drive up costs.
Rockset is a real-time analytics platform that enables fast analytics on real-time data. Its advanced indexing feature comprehensively processes these datasets to produce query results within milliseconds.
A solution like Rockset doesn’t replace your data warehouse, but it’s ideal as a complement for cases when you need fast analytics on real-time data. If you are building data apps or require low latency, high concurrency analytics on real-time data, try Rockset.