Topics
Data Warehouses and Lakes

Data Warehouses and Lakes

Databases are typically used by applications to store several types of data that are used by the users of the application. Applications need to frequently read and write data from and to a database. For example, an online shopping application would store data about products and allow users to browse through products. It allows them to add products to a shopping cart and finally complete a purchase transaction, storing all these transactions also in the database.

While the user is shopping using the application the access to the data needs to be fast, to provide a better user experience.

Most businesses need to perform a lot of analysis of customer, product and transaction data to determine their product and marketing strategy. They would like to know which customers (by location, age, gender and other demographics) are buying what products. To get this information, they need to analyze historic transaction data to determine patterns and trends. If analysis algorithms were to run on the same database that the application is using, it could add a lot of load on the database server, since analytics queries are complex. This could slow down the application performance and spoil the user experience.

One of the solutions to this challenge is to create a copy of historic transaction data, to be used solely for analysis. Such a copy of the database is known as a Data Warehouse. Data from a data warehouse is only usually read-only and is rarely modified since it must be an accurate copy of the transaction history. All modifications to the data must be made in the primary application database and the data warehouse may be refreshed periodically with the updated transaction data.

The primary application database is referred to as the Online Transaction Processing (OLTP) database. OLTP databases are usually highly optimized for performance, whereas data warehouses may not need to be as highly optimized since performance is important but less so than for an end-user application. Also, OLTP data is usually normalised while data in a data warehouse is often denormalized for faster retrieval (since data insertion or modification is not expected).

Data Warehouses usually store historic data from relational databases which hold structured data, typically used transactional systems, operational databases, and line of business applications. The term Data Lakes is used to refer to databases that store historic data from NoSQL or relatively unstructured OLTP databases such as those used in IoT systems and social media applications. The purpose of Data Warehouses and Data Lakes is the same, to serve as an input for data analysis.