Thursday, July 4, 2013

THE STAGING AREA

Whenever, we discuss/read about Data Warehousing, one term that we commonly encounter is:  THE STAGING AREA

A staging area is an intermediate storage area between the sources of information and the Data Warehouse (or Datamart).— WIKI
 
                                Diagrammatic representation to ease the understanding:

                                   
On the first look it always seems – Why do we need the staging area? Why can’t we simply load the data from the source system to DWH directly???

Though there are several reasons behind this, let us look into some major ones:

Scenario A :

Consider a workflow, where a DWH integrates data from various transaction systems.

We know that the data has to undergo several transformation/processing while loading into a Data Warehouse;

In the absence of a staging area, the data load will have to go from the OLTP system to the OLAP system directly.

This can severely hamper the performance of the OLTP system as the complex transformative queries will use the Source system resources for a long time which in turn can cascade in worrying the End users of that Transaction system.
                                      

                                                       Pulling directly from an active Transaction
 
 
                                                            Source With the advent of Staging Area

Scenario B:

In an enterprise scenario, we know that warehousing the data involves integrating multiple source systems.

The challenge becomes much more steeper when we have a business requirement of joining the data across systems.

Though this can be achieved using the ETLs to some extent, it becomes a walk-in-the-park if we know that all the data is consolidated and stored in an common database.

The flexibility is especially high when we have Flat-file Sources in the scheme of things.

                                                                          

Scenario C:

Timing reasons

Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.  For example:

a.       It might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. 

b.      Or say, we need to extract data from a database in Tokyo and integrate the contents coming from a DB in Toronto. The extraction time will be different for both the databases to make sure we extract the latest/correct data. Staging proves to be the ideal foil as a Temporary Storage area for the Tokyo data.

Summary and Additional info:
  • To gather data from different sources that will be ready to process at different times.
  • To quickly load information from the operational database, freeing it up as soon as possible. All the needed transformations can then occur without interfering with the operation.
  • Because staging is generally a direct load from Source, in some scenarios, Staging data comes in handy to analyse Source data issues(in cases where we fear that source may have refreshed).
  • Since staging area might not be persistent i.e. , once data is used for the further process it can be deleted.

No comments:

Post a Comment