Databases are usually relational (SQL) Also NoSQL, And transaction (OLTP), analysis (OLAP), or hybrid (HTAP). Departmental and special purpose databases were initially considered a significant improvement in business practices, but were later ridiculed as “islands.” Attempts to create an integrated database for all data across the enterprise are categorized as follows: Data lake A data warehouse if the data remains in native format and if the data is in a common format and schema. A subset of data warehouses are called data marts.
Defined data warehouse
Basically, a data warehouse is an analytic database, usually relational, created from two or more data sources, and typically stores petabytes of historical data. Data warehouses often have a large amount of computing and memory resources to execute complex queries and generate reports. Often, these are business intelligence (BI) systems and machine learning data sources.
Why use a data warehouse?
One of the main motivations for using an enterprise data warehouse (EDW) is that it limits the number and types of indexes that an operational (OLTP) database can create, which slows down analytic queries. By copying the data to the data warehouse, you can index all the important things in the data warehouse to improve the performance of your analytic queries without impacting the write performance of the OLTP database.
Another reason to use an enterprise data warehouse is to allow you to combine data from multiple sources for analysis. For example, a sales OLTP application probably doesn’t need to know the weather at the point of sale, but sales forecasts can use that data. Adding historical weather data to your data warehouse makes it easy to incorporate it into your model of historical sales data.
Data warehouse and data lake
A data lake that stores files of data in native format is essentially a “schema at read time”. This means that applications that read data from the lake must impose their own types and relationships on the data. The data warehouse, on the other hand, is a “write schema”. That is, data types, indexes, and relationships are imposed on the data when it is stored in the EDW.
“Schema on read” is suitable for data that may be used in some contexts, and while there is a risk that the data will not be used at all, there is little risk of losing the data. (((QuboleVendors of cloud data warehouse tools for data lakes estimate that 90% of the data in most data lakes is inactive. ) “Schema on write” is suitable for data that has a specific purpose and needs to be properly associated. To data from other sources. There is a risk that misformatted data will not be properly converted to the desired data type and may be discarded during import.
Data warehouse and data mart
The data warehouse contains data for the entire enterprise, and the data mart contains data for specific business lines. The data mart can be data warehousing-dependent, data warehousing-independent (that is, derived from a production database or an external source), or a hybrid of the two.
Reasons for creating a data mart include using less space, returning query results faster, and running at a lower cost than a complete data warehouse. Data marts often contain summary and selected data in place of or in addition to the detailed data in the data warehouse.
Data warehouse architecture
In general, a data warehouse has a layered architecture of source data, staging databases, ETL (extract, transform, load) or ELT (extract, transform, and transform) tools, appropriate data storage, and data display tools. there is. Each layer serves a different purpose.
Source data often includes operational databases from sales, marketing, and other parts of the business. It may also include social media and external data such as surveys and demographics.
The staging layer stores the data retrieved from the data source. If the source is unstructured, such as social media text, this is where the schema is imposed. This is also the place where quality checks are applied, deleting poor quality data and fixing common mistakes. The ETL tool pulls the data, performs the necessary mappings and transformations, and loads the data into the data storage layer.
The ELT tool saves the data first and then converts it later. If you use the ELT tool, you can also use a data lake to skip the traditional staging layer.
The data storage layer of your data warehouse contains clean-up, transformed data that is ready for analysis. Often a row-oriented relational store, but it can also be column-oriented or have a transposed list index for full-text search. Data warehouses often have far more indexes than operational data stores to speed up analytic queries.
Viewing data from a data warehouse is often done by executing a SQL query. SQL queries can be constructed using GUI tools. The output of SQL queries is often used to create display tables, charts, dashboards, reports, and forecasts using BI (Business Intelligence) tools.
Recently, data warehouses have begun supporting machine learning to improve the quality of models and forecasts. For example, Google BigQuery has added SQL statements that support a linear regression model for prediction and a binary logistic regression model for classification. Some data warehouses Deep learning library When Automatic machine learning (((AutoML)tool.
Cloud data warehouse and on-premises data warehouse
The data warehouse can be implemented as on-premises, cloud, or hybrid. Historically, data warehouses have always been on-premises, but the cost of capital and lack of scalability of data center on-premises servers can be a problem. EDW installations increased when vendors began offering data warehousing appliances. However, there is now a tendency to move all or part of the data warehouse to the cloud to take advantage of the unique scalability of cloud EDW and the ease of connecting to other cloud services.
The downside of placing petabytes of data in the cloud is the operational costs of both cloud data storage and cloud data warehouse computing and memory resources. While the time to upload petabytes of data to the cloud may seem like a major barrier, hyperscale cloud vendors are now offering high-capacity, disk-based data transfer services.
Top-down and bottom-up data warehouse design
There are two main ideas about how to design a data warehouse. The difference between the two is related to the direction of the data flow between the data warehouse and the data mart.
Top-down design (known as the inman approach) treats the data warehouse as a centralized data repository for the entire enterprise. Data marts are derived from data warehouses.
The bottom-up design (known as the Kimball approach) treats data marts as primary and combines them into a data warehouse. By Kimball’s definition, a data warehouse is a “copy of transactional data specially structured for querying and analysis.”
EDW insurance and manufacturing applications tend to prefer Inman’s top-down design approach. Marketing tends to prefer the Kimball approach.
Data lake, data mart, or data warehouse?
Ultimately, all decisions related to an enterprise data warehouse are summarized in the company’s goals, resources, and budget. The first question is if you need a data warehouse. The next task is to identify the data source, its size, current growth rate, and what you are currently doing to utilize and analyze them. You can then start experimenting with data lakes, data marts, and data warehouses to see what works for your organization.
We recommend proof-of-concept using a small subset of the data hosted on either your existing on-premises hardware or a small cloud installation. Once you’ve validated your design and demonstrated its benefits to your organization, you can scale it up to a full-fledged installation with full administrative support.
Copyright © 2021 IDG Communications, Inc.
What is a data warehouse?Source of business intelligence
Source link What is a data warehouse?Source of business intelligence