Cutting Through the Confusion: Data Warehouse vs. Data Lake vs. Data Lakehouse Share: ITRex Date Published 11 March 2022 Categories Blog Reading Time 15-Minute Read Choosing the right data storage solution for business has never been more challenging with all the technological advances. Struggling to harness data sprawl, CIOs across industries are facing tough challenges. One of them is where to store all of their enterprise’s data to deliver robust data analytics. There have traditionally been two storage solutions for data: data warehouses and data lakes. Data warehouses mainly store transformed, structured data from operational and transactional systems, and are used for fast complex queries across this historical data. Data lakes act as a dump, storing all kinds of data, including semi-structured and unstructured data. They empower advanced analytics like streaming analytics for live data processing or machine learning. Historically, data warehouses were expensive to roll out because you needed to pay for both the storage space and computing resources, apart from skills to maintain them. As the cost of storage has declined, data warehouses have become cheaper. Some believe data lakes (traditionally a more cost-efficient alternative) are now dead. Some argue data lakes are still trendy. Meanwhile, others are talking about a new, hybrid data storage solution — data lakehouses. What’s the deal with each of them? Let’s take a close look. This blog explores key differences between data warehouses, data lakes, and data lakehouses, popular tech stacks, and use cases. It also provides tips for choosing the right solution for your company, though this one is tricky. What is a data warehouse? Data warehouses are designed to store structured, curated data, organizing datasets in tables and columns. This data is easily available to users for traditional business intelligence, dashboards, and reporting. Data warehouse architecture A three-tier architecture is the most commonly used approach to designing data warehouses. It comprises: Bottom tier: A staging area and the database server of the data warehouse that is used to load data from various sources. An extraction, transformation, and loading (ETL) process is a traditional approach to pushing data into the data warehouse Middle tier: A server for online analytical processing (OLAP) that reorganizes data into a multidimensional format for fast calculations Top tier: APIs and frontend tools for working with data Figure 1: Data Warehouse Reference Architecture There are three other vital components of a data warehouse that should be mentioned: the data mart, the operational data storage, and metadata. Data marts belong to the bottom tier. They store subsets of the data warehouse data, serving individual business lines. Operational data stores act as a repository providing a snapshot of the organization’s most current data for operational reporting based on simple queries. They may be used as an interim layer between the data sources and the data warehouse. There is also metadata — data describing the data warehouse data — which is stored in special-purpose repositories, also at the bottom layer. Data warehouse evolution and technologies Data warehouses have been around for a few decades. Traditionally, data warehouses were hosted on premises, meaning companies had to purchase all hardware and deploy software locally, either paid or open-source systems. They also needed a whole IT team to maintain the data warehouse. On the bright side, traditional data warehouses were bringing in (and still do so today) a fast time-to-insight with no latency issues, total control of data together with one hundred percent privacy, and minimized security risk. With cloud ubiquity, many organizations now choose to migrate to cloud data warehouse solutions where all data is stored in a cloud. It is analyzed in a cloud, too, using some type of an integrated query engine. There are a variety of established cloud data warehouse solutions in the market. Each provider offers its unique set of warehouse capabilities and different pricing models. For example, Amazon Redshift is organized as a traditional data warehouse. Snowflake is similarly. Microsoft Azure is an SQL data warehouse, while Google BigQuery is based on a serverless architecture offering in essence software-as-a-service (SaaS), rather than infrastructure or platform-as-a-service like, for instance, Amazon Redshift. Among well-known on-premises data warehouse solutions are IBM Db2, Oracle Autonomous Database, IBM Netezza, Teradata Vantage, SAP HANA, and Exasol. They are also available on the cloud. Cloud-based data warehouses are obviously cheaper because there is no need to buy or roll out physical servers. Users pay only for the storage space and computing power as needed. Cloud solutions are also much easier to scale or integrate with other services. Serving highly specific business needs with top data quality and fast insights, data warehouses are here to stay for long. Data warehouse use cases Data warehouses deliver high-speed and high-performance analytics on petabytes and petabytes of historical data. They are fundamentally designed for BI-type queries. A data warehouse might give an answer about, for instance, sales in a particular time period, grouped by region or division, and year-on-year movements in sales. Key use cases for data warehouses are: Transactional reporting to deliver a picture of business performance Ad-hoc analysis/reporting to provide answers to standalone and “one-off” business challenges Data mining to extract useful knowledge and hidden patterns from data to solve complex real-world problems Dynamic presentation through data visualization Drilling down to go through hierarchical dimensions of data for details Having structured business data in one easily accessible location outside operational databases is pretty much important to any data mature company. However, traditional data warehouses don’t support big data technology. They are also updated in batch, with records from all sources processed periodically in one go, which means that the data can become stale by the time it’s rolled up for analytics. The data lake seems to resolve these constraints. With a tradeoff. Let’s explore. What is a data lake? Data lakes mostly collect unrefined raw data in its original form. Another key difference between the data lake and the data warehouse is that data lakes store this data without arranging it into any logical relationships that are called schemas. However, this is how they enable more sophisticated analytics. Data lakes pull in (i) transactional data from business applications such as ERP, CRM, or SCM, (ii) documents in .csv and .txt formats, (iii) semi-structured data such as XML, JSON, and AVRO formats, (iv) device logs and IoT sensors, and (v) images, audio, binary, PDF files. Data lake architecture Data lakes use a flat architecture for data storage. Its key components are: Bronze zone for all data ingested into the lake. Data is stored either as-is for batch patterns or as aggregated datasets for streaming workloads Silver zone where data is filtered and enriched for exploration according to business needs Gold zone where curated, well-structured data is stored for applying BI tools and ML algorithms. This zone often features an operational data store that feeds traditional data warehouses and data marts Sandbox where data can be experimented with for hypothesis validation and tests. It is implemented either as a completely separate database for Hadoop or other NoSQL technologies or as a part of the gold zone. Figure 2: Data Lake Reference Architecture Data lakes don’t inherently contain analytics capabilities. Without them, they just store raw data that is not useful in its own right. So, organizations build data warehouses or leverage other tools on top of data lakes to put data to use. To make sure a data lake doesn’t turn into a data swamp, it is important to have an efficient data management strategy to include built-in data governance and metadata management in data lake design. In an ideal world, data sitting in a data lake should be cataloged, indexed, validated, and easily available to data users. This is rarely a case though and many data lake projects fail. This can be avoided: regardless of the maturity of a data team, it is critical to install at least essential controls to enforce data validation and quality. Data lake evolution and technologies The rise of big data in the early 2000s has brought both grand opportunities and grand challenges for organizations. Business needed new technology to analyze these massive, messy, and ridiculously fast-growing datasets to capture a business impact from the big data. In 2008, Apache Hadoop came up with an innovative open-source technology for collecting and processing unstructured data on a massive scale, paving the way for big data analytics and data lakes. Shortly after, Apache Spark emerged. It was easier to use. In addition, it provided capabilities for building and training ML models, querying structured data using SQL, and processing real-time data. Today data lakes are predominately cloud-hosted repositories. All top cloud providers such as AWS, Azure, and Google offer cloud-based data lakes with cost-effective object storage services. Their platforms come with various data management services to automate deployment. In one scenario, for instance, a data lake might consist of a data storage system like the Hadoop Distributed File System (HDFS) or Amazon S3 integrated with a cloud data warehouse solution like Amazon Redshift. These components would be decoupled from services in the ecosystem which might include Amazon EMR for data processing, Amazon Glue that provides the data catalog and transformation functionality, the Amazon Athena query service, or Amazon Elasticsearch Service that is used to build a metadata repository and index data. Local data lakes are still common because of usual cloud concerns like security, privacy, or latency. There are also on-premise storage vendors that offer some products for data lakes, but their data lake offerings, however, are not well-defined. Unlike data warehouses, data lakes don’t have many years of real-world deployments behind them. There is still much criticism describing the data lake concept as blurry and ill-defined. Critics also argue that few people in any organization have the skills (or enthusiasm for that matter) to run exploratory workloads against raw data. The idea that data lakes should be used as a central repository for all enterprises’ data needs to be approached with caution, they say. There has also been a provocative talk that data lake days are numbered. The following reasons are cited: Data lakes can’t scale compute resources efficiently on demand (well, this is because they are not intended by design in the first place) Data lakes carry a big technology debt, with their creation primarily driven by marketing hype, rather than technical reasons (the same has happened with many data warehouses too) With the rise of cloud data warehouse solutions, data lakes don’t any longer offer significant cost benefits (the cost issue is not that much straightforward as it’s hard to forecast computing costs) Such criticism is an inherent part of any younger technology. However, data lakes do have clear use cases like streaming analytics. And just yet, they don’t threaten data warehouses. At some point, data lakes even triumphed over data warehouses, offering wider analytics capabilities, cost-effectiveness, and flexibility in terms of data stored. However, as data warehouse technologies have matured, many agree there is no obvious winner now. It is generally advisable to maintain them both or… go for a hybrid architecture. Read on. Data lake use cases The main idea about data lakes is to give business access to all available data from all sources as quickly as possible. Data lakes do not just give a picture of what happened yesterday. Storing massive amounts of data, data lakes are designed to enable organizations to learn more about both the present (using streaming analytics) and the future (using big data solutions, including predictive analytics and machine learning). Key use cases for data lakes are: Feeding an enterprise data warehouse with datasets Performing stream analytics Implementing ML projects Building advanced analytics charts using long-established enterprise BI tools like Tableau or MS Power BI Building custom data analytics solutions Running root cause analysis that allows data teams to trace problems to their roots With strong data engineering skills to move raw data into an analytics environment, data lakes can be extremely relevant. They allow teams to experiment with data to understand how it can be useful. This might involve building models to dig through data and try out different schemas to view the data in new ways. Data lakes also allow wrangling with stream data that is pouring in from web logs and IoT sensors and is not suited for a traditional data warehouse approach. In short, data lakes enable organizations to unearth patterns, anticipate changes, or find potential business opportunities around new products or current processes. Used for different business needs, data lakes and data warehouses are often implemented in tandem. Before we move to the next data storage concept, let’s quickly recap the key differences between the data warehouse and the data lake. Data warehouse vs. data lake What about a new hybrid architecture, data lakehouses? Marketing aside, the key idea about a data lakehouse is to bring computing power to a data lake. Architecturally, the data lakehouse usually consists of: Storage layer to store data in open formats (e.g., Parquet). This layer can be called a data lake, and it is separated from the computing layer Computing layer that gives the organization warehouse capabilities, supporting metadata management, indexing, schema enforcement, and ACID (Atomicity, Consistency, Reliability, and Durability) transactions APIs layer to access data assets Serving layer to support various workloads, from reporting to BI, data science, or machine learning. Figure 3: Data Lakehouse Reference Architecture Touted as a solution marrying the best of both worlds, the data lakehouse addresses both: Data warehouse constraints, including lack of support of advanced data analytics that relies both on structured and unstructured data and significant scaling costs with traditional data warehouses that do not separate storage from computing resources Data lake challenges, including data duplication, data quality, and the need to access multiple systems for various tasks or implement complex integrations with analytics tools The data lakehouse is a new advancement in the data analytics scene. The concept was first used in 2017 in relation to the Snowflake platform. In 2019, AWS used the data lakehouse term to describe its Amazon Redshift Spectrum service that allows users of its data warehouse service Amazon Redshift to search through data stored in Amazon S3. In 2020, the data lakehouse term came into widespread usage, with Databricks adopting it for its Delta Lake platform. The data lakehouse might have a bright future ahead as companies across industries are adopting AI to improve service operations, offer innovative products and services, or drive marketing success. Structured data from operational systems delivered by data warehouses is ill-suited for smart analytics, while data lakes are just not designed for robust governance practices, security, or ACID compliance. Data lake vs. data lakehouse So data warehouse vs. data lake vs. data lakehouse: which to choose Whether you want to build a data storage solution from scratch or modernize your legacy system to support ML or improve performance, the right answer won’t be easy. There’s still a lot of mess about key differences, benefits, and costs, with offerings and pricing models from vendors rapidly evolving. Besides, it’s always a difficult project even if you have stakeholders’ buy-in. However, there are some key considerations when choosing the data warehouse vs. data lake vs. data lakehouse. The primary question you should answer is: WHY. A good point here to remember is that key differences between data warehouse, lakes, and lakehouses do not lie in technology. They are about serving different business needs. So why do you need a data storage solution in the first place? Is it for regular reporting, business intelligence, real-time analytics, data science, or other sophisticated analysis? Is data consistency or timeliness more important for your business needs? Spend some time developing use cases. Your analytics needs should be well defined. You should deeply understand your users and skillsets too. A few rules of thumbs are: A data warehouse is a good bet if you have exact questions and know what analytics results you want to get regularly. If you are in a highly regulated industry like healthcare or insurance, you might need to comply with extensive reporting regulations above all. So, a data warehouse will be a better choice. If your KPIs and reporting requirements can be addressed with simple historical analysis, a data lake or a hybrid solution will be an overkill. Go with a data warehouse instead. If your data team is after experimental and exploratory analysis, choose a data lake or a hybrid solution. However, you’ll need strong data analytics skills to work with unstructured data. If you are a data mature organization that wants to leverage machine learning technology, a hybrid solution or data lake will be a natural fit. Consider also your budget and time constraints. Data lakes are surely faster to build than data warehouses, and probably cheaper. You might want to implement your initiative incrementally and add capabilities as you scale up. If you want to modernize your legacy data storage system, then again, you should ask WHY you need this. Is it too slow? Or doesn’t it allow you to run queries on bigger data sets? Is some data missing? Do you want to pull out a different type of analytics? Your organization has spent a lot of money on the legacy system, so you definitely need a strong business case to ditch it. Tie it to an ROI too. Data storage architectures are still maturing. It is impossible to say for sure how they will evolve. However, no matter which path you’ll take, it is useful to recognize common pitfalls and make the most of the technology that is already here. We hope this article has cleared up some confusion about data warehouses vs. data lakes vs. data lakehouses. If you still have questions or need top tech skills or advice to build your data storage solution, drop ITRex a line. They will help you. Originally published at https://itrexgroup.com on February 23, 2022.