Wednesday, January 11, 2012

Fundamentals of Data Warehousing

Data Warehousing (DWH)
Data warehousing is the vast field that includes the processes and methodologies involved in the creating, populating, querying and maintaining data in a warehouse. The objective of DWH is to help to maintain the historical data and facilitate users make better business decisions with that data.
Data Warehouse:
As per Bill Inmon "A warehouse is a Historical, subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
By Historical we mean, the data is continuously collected from sources and loaded in the warehouse. The previously loaded data is not deleted for long period of time. This results in building historical data in the warehouse.
By Subject Oriented we mean data grouped into a particular business area instead of the business as a whole.
By Integrated we mean, collecting and merging data from various sources. These sources could be disparate in nature.
By Time-variant we mean that all data in the data warehouse is identified with a particular time period.
By Non-volatile we mean, data that is loaded in the warehouse is based on business transactions in the past, hence it is not expected to change over time.
Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a central normalized repository containing multiple subject area data for an organization. EDW are generally used for feeding data to subject area specific data marts.

We must consider building an EDW when
1. We have clarity on multiple business processes requirements.
2. Want a centralized architecture catering to single version of truth.
Data Mart:
When we restrict the scope of a data warehouse to a sub set of business process or subject area we call it a Data Mart. Data Marts can be stand-alone entities, built from warehouses or be used to build warehouse.
1. Stand-Alone Data Marts – These are single subject areas data warehouses. Standalone data marts could cause many stove-pipe/silo data marts in the future.
2. Data Marts fed from Enterprise Data Warehouses (Top-Down). This is the Top-Down strategy suggested by Bill Inmon. Here we build an enterprise data warehouse (EDW) hosting data for multiple subject areas. From the EDW we build subject area specific data marts.
3. Data Marts building Data Warehouses (Bottom-Up). This is the Bottom-Up strategy suggested by Ralph Kimball. Here we build subject area specific data marts first, keeping in mind the shared business attributes (called conformed dimensions). Using the conformed dimensions we build a warehouse.
Online Transaction Processing Systems (OLTP):
Online Transaction Processing Systems (OLTP) are transactional systems that perform the day to day business operations. In other words, OLTP are the automated day to day business processes. In the world of DWH, they are usually one of the types of source system from where the day to day business transactional data is extracted. OLTP systems contain the current valued business data that is updated based on business transactions.

Legacy Systems:

Legacy system is the term given to a historical system having very large storage capabilities. Mainframes and AS400 are the two examples of legacy systems. Applications running on legacy systems are generally OLTP systems. Due to their large storage capabilities the business transactional data from legacy applications and other OLTP systems is stored on them and maintained for a long period of time. In the world of DWH, legacy systems are often used to extract past historical business data.

Data Profiling:
This is a very important activity mostly performed while collecting the requirements and the system study. Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart.
In the most basic form, data profiling process would read the source data and report on the
Type of data – Whether data is Numeric, Text, Date etc.
Data statistics – Minimum and maximum values, most occurred value.
Data Anomalies like Junk values, incorrect business codes, invalid characters, values outside a given range.
Using data profiling we can validate business requirements with respect to business codes and attributes present in the sources, define exceptional cases when we get incorrect or inconsistent data.
A lot of Data Profiling tools are available in the market. Some examples are Trillium Software and IBM Websphere Data Integrator’s Profile Stage.
In the simplest form of data profiling, we can even hand code (programming) data profile application specific to projects. Programming languages like PERL, Java, PLSQL and Shell Scripting can be very handy in creating these applications.
Data Modeling:
This activity is performed once we have understood the requirements for the data warehouse/mart. Data modeling refers to creating the structure and relationship of the data for the business rules defined in the requirements.
We create the conceptual model followed by logical model followed by physical model. Here we see a progression in the level of detail as we proceed toward the physical model.
Conceptual Model:
Conceptual Model identifies at a high level the subject areas involved. The objective of creating the conceptual model is to identify the scope of the project in terms of the subject areas involved.
Logical Model:
Logical Model identifies the details of the subject areas identified in the conceptual model, and the relationships between the subject areas. Logical model identifies the entities within the subject area and documents in detail the definition of the entity, attributes, candidate keys, data types and the relationship with other entities.
Physical Model:
Physical Model identifies the table structure, Columns within the tables, Primary Keys, data types and size of the columns, constraints on the table and the privileges on the tables. Using the physical model we develop the database objects through the DDL's developed in the physical model.
Normalization:
This is a technique used in data modeling that emphasizes on avoiding storing the same data element at multiple places. We follow the 3 rules of normalization called the First Normal Form, Second Normal Form, and Third Normal Form to achieve a normalized data model.
A normalized data model may result in many tables/entities having multiple levels of relationships, example table1 related to table2, table2 further related to table3, table3 related to table 4 and so on.
First Normal Form (1NF) – The attributes of the entity must be atomic and must depend on the Key.
Second Normal Form (2NF) – This rule demands that every aspect of each and every attribute depends on Key.
Third Normal Form (3NF) – This rule demands that every aspect of each and every attributes depends on nothing but the key.
Theoretically we have further rules called the Boyce-Codd Normal Form, Fourth Normal Form and the Fifth Normal form. In practice we don’t use the rules beyond 3NF.
Dimensional Modeling:
This is the form of data modeling used in data warehousing where we store business related attributes in tables called Dimension Tables and the business related metrics/measures in tables called Fact Tables. The business metrics are associated with the business attributes by relating the dimension tables with the fact table. The dimension tables are not related to one another.
A fact table could be related to many dimension table and hence form the center of the structure when graphically represented. This structure is also called the star schema.
Please note that here we do not emphasize on normalizing the data instead have lesser levels of relationships.
Extract Transform Load (ETL):
The back-end processes involved in collecting data from various sources, preparing the data with respect to the requirements and loading it in the warehouse/mart.
Extraction – This is the process where we select the data from various source systems and transport it in an ETL server. Source systems could range from one to many in number, and similar or completely disparate in nature.
Cleansing – Once the data has been extracted, we need to check whether the data is valid and ensure consistency when coming from disparate sources. We check for valid values of data by performing ETL data validations rules. The rules could be defined to check for correct data formats (Numeric, Date, Text), data within the range, correct business codes, check for junk values. We can ensure consistency of data when it is collected from disparate sources by making sure that the same business attribute will have the same representation. Example would be the gender of customer may be represented as “m” and “f” in one source system and “male” and “female” in another source system. In this case we will make sure we convert “male” to “m” and “female” to “f” so that the gender coming from the second source system is consistent with that from the first source system.
Transformation – This is the process where we apply the business rules to the source data before loading it to the warehouse/mart. Transformation as the term suggests converts the data based on the business rules. Transformation also adds further business context to the data. The types of transformations we generally use in ETL are sort data, merge data, compare data, generate running numbers, aggregate data, change data type etc.
Loading – Once the source data is transformed by applying business rules, we load it into the warehouse/mart. There are two types of loads we perform in data warehousing.
History Load / Bulk Load – This is the process of loading the historical data over a period of time into the warehouse. This is usually a onetime activity where we take the historical business data and bulk load into the warehouse/mart.
Incremental Load – This follows the history load. Based on the availability of the current source data we load the data into the warehouse. This is an ongoing process and is performed periodically based on the availability of source data (eg: daily, weekly, monthly etc).
Operational Data Store (ODS):
Operational Data Store (ODS) is layer of repository that resides between a data warehouse and the source system. The current, detailed data from disparate source systems is integrated, grouped subject areas wise in the ODS. The ODS is updated more frequently than a data warehouse. The updates happen shortly after a source system data changes. There are two objectives for an ODC
1. Source DWH with detailed, integrated data.
2. Provide organization with operational Integration.
Online Analytical Processing (OLAP):
Online Analytical Processing (OLAP) is the approach involved in providing analysis to multidimensional data. There are various forms of OLAP namely:
Multidimensional OLAP (MOLAP) – This uses a cube structure to store the business attributes and measures. Conceptually it can be considered as multi-dimensional graph having various axes. Each axis represents a category of business called a dimension (eg location, product, time, employee etc). The data points are the business transactional values called measures. The coordinate of each data point related it to the business dimension. The cubes are generally stored in proprietary format.
Relational OLAP (ROLAP) – This uses database to store the business categories as dimension tales (containing textual business attributes and codes) and the fact tables to store the business transactional values (numeric measures). The fact tables are connected to the dimension tables through the Foreign-Key relationships.
Hybrid OLAP (HOLAP) – This uses a cube structure to store some business attributes and database tables to hold the other attributes.
Business Intelligence (BI):
Business Intelligence (BI) is a generic term used for the processes, methodologies and technologies used for collecting, analyzing and presenting business information.
The objective of BI is to support users make better business decisions. The objective of DWH and BI go hand-in-hand, that of enabling users make better business decisions.
Data marts/warehouses provide easily accessible historical data to the BI tools to provide end users with forecasting/predictive analytical capabilities and historical business patters and trends.
BI Tools are generally accompanied with reporting features like Slice-and-dice, Pivot-table-analysis, Visualization, and statistical functions that make the decision making process even more sophisticated and advanced.
Decision Support Systems (DSS):
A Decision Support System (DSS) is used for management based decision making reporting. It is basically an automated system to gather business data and help make management business decisions.

Data Mining:
Data mining in the broad sense deals with extracting relevant information from very large amount of data and make intelligent decision based on the information extracted.

Building blocks of Data Warehousing:
If you had to build a Data Warehouse, you would need the following components
Source System(s) – A source system is generally a collection of OLTP System and/or legacy systems that contains day to day business data.
ETL Server – These are generally Symmetric Multi-Processing (SMP) machines or Massively Parallel Processing Machines (MPP).
ETL Tool – These are sophisticated GUI based applications that enable the operation of Extracting data from source systems, transforming and cleaning data and loading the data into databases. ETL tool resides on the ETL server.
Database Server – Usually we have the database reside on the ETL server itself. In case of very large amount of data we may want to have a separate Symmetric Multi-Processing machines or Massively Parallel Processing Machines clustered to exploit parallel processing of the RDBMS.
Landing Zone – Usually a file-system and sometimes a database used to land the source data from various source systems. Generally the data in the landing zone is in the raw format.
Staging Area -Usually a file-system and a database used to house the transformed and cleansed source data before loading it to the warehouse or mart. Generally we avoid any further data cleansing and transformation after we have loaded data in the staging area. The format of the data in staging area is similar to that of in the warehouse or mart.
Data Warehouse – A Database housing the normalized (top-Down) or conformed star schemas (bottom-up) Reporting Server - These are generally Symmetric Multi-Processing machines or Massively Parallel Processing Machines.
Reporting Tool – A sophisticated GUI based application that enables users to view, create and schedule reports on data warehouse.
Metadata Management Tool – Metadata is defined as description of the data stored in your system. In data warehousing the Reporting tool, database and ETL tool have their own metadata. But if we look at the overall picture, we need to have Metadata defined starting from source systems, landing zone, staging area, ETL, warehouse, mart and the reports. An integrated metadata definition system containing metadata definition from source, landing, staging, ETL, warehouse, mart and reporting is advisable.


No comments:

Post a Comment