DataWarehousing Concepts

What is Datawarehouse?

A data warehouse is a electronic storage of an Organization's historical data for the purpose of Data Analytics, such as reporting, analysis and other knowledge discovery activities.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Notes : 
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.  



Non-volatile: It means that the data once loaded in the warehouse will not get deleted later.
 Time-variant means the data will change with respect to time.
Other than Data Analytics, a data warehouse can also be used for the purpose of data integration, master data management etc.

What is meant by Data Analytics?

Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. A data warehouse is often built to enable Data Analytics.

What is a Dimension?

A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to a customer on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg.

Types of Dimensions commonly used in data warehousing 
  • Conformed Dimensions
  • Junk Dimensions
  • Role-playing Dimensions
  • Slowly Changing Dimensions
  • Degenerated Dimensions

What is Conformed Dimension?

A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization. For two dimension tables to be considered as conformed, they must either be identical or one must be a subset of another. There cannot be any other type of difference between the two tables. For example, two dimension tables that are exactly the same except for the primary key are not considered conformed dimensions.
Why is conformed dimension important? This goes back to the definition of data warehouse being "integrated." Integrated means that even if a particular entity had different meanings and different attributes in the source systems, there must be a single version of this entity once the data flows into the data warehouse. The common example would be Time Dimensions.
In other words,
A Dimension that is used in multiple locations is called conformed dimensions. A conformed dimension may be used with multiple fact tables in single database, or across multiple data marts or Data warehouses.

I.e. Above shown Customer and Product Dimensions are Conformed Dimensions as they are connected to Shipment Fact table, Sales Order Fact table, and Service Request Fact table.

What is Junk Dimension?

Junk dimension is the way to solve the problem which comes up whenever we have multiple indicators and we are storing in the fact table or you have separate dimension tables and its increasing tremendously, leading to possible performance and management issues.. In a junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk dimension table is the combination of all possible values of the individual indicator fields. 

I.e.:  Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.


What is Role-playing Dimensions?

A Role Playing Dimensions are the Dimensions which often used for multiple purposes within same database.Here same dimension key is associated with more than one foreign key in the fact table in the database for the different purposes.
I.e.:  In Date dimensions, [FullDateAlternateKey] is associated with [Orderdate key], [Duedate key], and [Shipdate] key in the fact table to solve different purpose in Data warehouse.

What is Slowly Changing Dimensions?

This is widely used Dimensions type. It is the dimensions where attribute values changes with time. There are various types(such as SCD 1,SCD2,SCD3) of Slowly Changing Dimensions (SCD) based on how business manages this dimensions.

What is Degenerated Dimensions?

A degenerate dimension is a dimension which is derived from the fact table and doesn’t have its own dimension table.
In Data warehouse this Dimension often used to show drill through capability where in the report you can see how aggregated number came up.

I.e. Invoice no can be stored in the fact table and then used as separate dimensions for the drill through purpose to find out what invoices are part of total buying cost in report.

What is fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

What is Additive,Semi-Aditive and Non-aditive measures?

Additive Measures

Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.
Semi Additive Measures

Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Non-additive Measures

Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Comments