“A staging area is an intermediate storage area between the sources of information and the
Data Warehouse (or Datamart).—
WIKI
Diagrammatic representation to ease the understanding:
On the
first look it always seems – Why do we need the staging area? Why can’t we
simply load the data from the source system to DWH directly???
Though
there are several reasons behind this, let us look into some major ones:
Scenario A :
Consider a workflow, where a DWH integrates data from
various transaction systems.
We know that the data has to undergo several
transformation/processing while loading into a Data Warehouse;
In the absence of a staging area, the data load will
have to go from the OLTP system to the OLAP system directly.
This can severely hamper the performance of the OLTP
system as the complex transformative queries will use the Source system
resources for a long time which in turn can cascade in worrying the End users
of that Transaction system.
Pulling directly from an active Transaction
Source With the advent of Staging Area
Scenario
B:
In an enterprise scenario, we know that warehousing
the data involves integrating multiple source systems.
The challenge becomes much more steeper when we have a
business requirement of joining the data across systems.
Though this can be achieved using the ETLs to some
extent, it becomes a walk-in-the-park if we know that all the data is
consolidated and stored in an common database.
The flexibility is especially high when we have
Flat-file Sources in the scheme of things.
Scenario
C:
Timing reasons
Due to varying business cycles, data processing cycles, hardware and
network resource limitations and geographical factors, it is not feasible to
extract all the data from all Operational databases at exactly the same
time. For example:
a. It might be reasonable
to extract sales data on a daily basis, however, daily extracts might not be
suitable for financial data that requires a month-end reconciliation
process.
b. Or say, we need to
extract data from a database in Tokyo and integrate the contents coming from a
DB in Toronto. The extraction time will be different for both the databases to
make sure we extract the latest/correct data. Staging proves to be the ideal
foil as a Temporary Storage area for the Tokyo data.
Summary
and Additional info:
- To gather data from different
sources that will be ready to process at different times.
- To quickly load information from
the operational database, freeing it up as soon as possible. All the
needed transformations can then occur without interfering with the
operation.
- Because staging is generally a
direct load from Source, in some scenarios, Staging data comes in handy to
analyse Source data issues(in cases where we fear that source may have
refreshed).
- Since staging area might not be
persistent i.e. , once data is used for the further process it can be
deleted.
No comments:
Post a Comment