Tuesday, December 17, 2013

DWBI - A Business Perspective - Day 5


Questions:

1. Name four key issues to be considered while planning for a data warehouse.
2. Explain the difference between the top-down and bottom-up approaches for building data
    warehouses. Do you have a preference? If so, why?
3. List three advantages for each of the single vendor and multi vendor solutions?
4. What is meant by a preliminary survey of requirements ? List six types of information you will
    gather during a preliminary survey?
5. How are data warehouse projects different from OLTP system projects? Describe four such
    differences.
6. List and explain any four of the development phases in the life cycle of data ware house project.
7. What do you consider to be a core set of team roles for a data warehouse project? Describe the   
     three roles from your set.
8. List any three warnings signs likely to be encountered in a data warehouse project. What corrective
    actions will you need to take to resolve the potential 
   problems indicated by these three warning signs?
9. Name and describe any five of the success factors in a data warehouse project.
10.What is meant by 'taking a practical approach "to the management of a data ware house project?
     Give ay two reasons why you think a practical approach is likely to succeed.

Business Cases:  
  
1. As the recently assigned project manager, you are required to work with the executive sponsor to
    write a justification without detailed ROI calculations for the first data warehouse project in your
    company. Write a justification report to be included in the planning document.
2. You are the data transformation specialist for the first data warehouse project in an airlines
    company. Prepare a project task list to include all the detailed tasks needed for data extraction and
    transformation.
3. Why do you think user participation is absolutely essential for success? As a member of the
    recently formed data warehouse team in a banking business, your job is to write a report on how
    the user departments can best participate in the development. What specific responsibilities for the
    users will you include in your project?
4. As the lead architect for a data warehouse in a large domestic retail store chain, prepare a list of
    project tasks relating to designing the architecture. In which development phases will these tasks
    be performed?

Monday, December 9, 2013

DWBI - A Business Perspective - Day 4

In many companies, some sort of formal justification is needed to initiate and fund an IT project, this is called cost-justification analysis. A rough breakdown of the costs is as follows: hardware-31%; software, including DBMS - 24%, staff and system integrators - 35%; administration-10%.

How can you justify the total cost by balancing the risks against the benefits?

How can you calculate the ROI (Return of Investment)?

How can you make a business case?

Usually the senior management demands a business case as a part of RFI (Request for Information)  to roll out an RFP (Request for Proposal).

"A business case captures the reasoning for initiating a project or task. It is often presented in a well-structured written document, but may also sometimes come in the form of a short verbal argument or presentation. The logic of the business case is that, whenever resources such as money or effort are consumed, they should be in support of a specific business need." - Wikipedia

I want to present few typical approaches taken for justifying DWBI project, pick the best that will work for your organization. Here are some sample approaches for preparing the business case:

1. Calculate the current technology costs to produce the applications and reports supporting strategic
    decision making. Compare this with the estimated costs for the data warehouse and find the ratio
    between the current costs and proposed costs. See if this ratio is acceptable to senior management.

2. Calculate the business value of the proposed data warehouse with the estimated dollar values for 
    profits, dividends, earnings growth, revenue growth,  and market share growth. Review this
    business value expressed in dollars against the data warehouse costs and come up with the
    justification.

3. Identify all the components that will be affected by the proposed data warehouse and those that
    will affect the data warehouse . Start with the cost items, one by one, including hardware purchase
    or lease, vendor software, in-house software,  installation and conversion, on-going support, and
    maintenance costs. Then put a dollar value on each of the tangible and intangible benefits
    including cost reduction, revenue enhancement, and effectiveness in the business community. Go
    further to do a cash flow analysis and calculate the ROI.

Friday, December 6, 2013

DWBI - A Business Perspective - Day 3


Questions:

1. State any three factors that indicate the continued growth in data warehousing. Can you think of
    some examples?
2. Why do data warehouses continue to grow in size, storing huge amounts of data? Give any three
    reasons.
3. Why is it important to sore multiple types of data in the data warehouse? Give examples of some
    non structured data likely to be found in the data warehouse of a Sales and Marketing function.
4. Describe the types of charts you are likely to see in the information delivery.
5. What is SMP (Symmetric Multiprocessing) parallel processing hardware? Describe the
    configuration.
6. What is MPP (massively parallel processing) parallel processing hardware? Describe the
    configuration.
7. Explain what is meant by agent technology? How can this technology be used in a data
    warehouse?
8. Describe any one of the options available to integrate ERP with data warehousing?
9. What is CRM? How can you make your data warehouse CRM-ready?
10. What do you mean by a web-enabled data warehouse? Describe three of its functional features.

Business Cases:

1. As a senior analyst on the data warehouse project of a large retail chain, you are responsible for 
    improving data visualization of the output results. Make a list of your recommendations?
2. Explain how and why parallel processing can improve the performance for data loading and index
    creation?
3. Discuss three specific ways in which agent technology may be used to enhance the value of the 
    data warehouse in a large manufacturing company.
4. Your company is in the business of renting DVDs and video tapes. The company has recently 
    entered into e-commerce and the senior management wants to make    the existing data warehouse
    Web-enabled. List and describe any three of the major tasks required for satisfying the
    management's directive.

Wednesday, December 4, 2013

DWBI - A Business Perspective - Day 2

"Master the basics and rest follows...", below are the questionnaire on data warehouse fundamentals.

Questions:

1. Name at least six characteristics or features of a data warehouse.
2. Why is data integration required in a data warehouse, more so there than in an operational
    application?
3. Every data structure in the data warehouse contains the time element. Why?
4. Explain data granularity and how it is applicable to the data warehouse.
5. How are top-down and bottom-up approaches for building a data warehouse different? Discuss the 
    merits and disadvantages of each approach.
6. Why do you need a separate data staging component?
7. Under data transformation, list five different functions you can think of.
8. What are the various datasources for the data warehouse?
9. Name any six different methods for information delivery.
10.What are the major types of metadata in a data warehouse? Briefly mention the purpose of each
      type.
 
Business Cases:

1. A data warehouse is subject-oriented. What would be the major critical business subjects for the  
    following companies?
    a. An international manufacturing company
    b. a local community bank
    c. An international retail chain
2. You are the data analyst on the project team building a data warehouse for an insurance company.
    List the possible data sources from which you will bring    the data into your warehouse. State
    your assumptions.
3. For an airlines company, identify three operational applications that would feed into the data
    warehouse. What would be the data load and refresh cycles?
4. Prepare a table showing all the potential users and information delivery methods for data
    warehouse supporting a large national grocery chain.

DWBI - A Business Perspective - Day 1

Hi, i would like to drive a series on "DWBI - A Business Perspective" that helps you to look at DWBI technology as a business person, this series will have questions and business cases to work on. I believe in self learning than spoon feeding, because it breaks through the barriers of information availability and helps us explore more.

But still if you prefer the earlier approach, these are the direct references from
"Data Warehousing Fundamentals for IT Professionals" - Paulraj Ponniah, a book that helped my survival as a Business Analyst. I sincerely thank Paulraj Ponniah Garu for his writings and recommend all to go through.

Questions:

1. What do you mean by strategic information? For a commercial bank name five types of strategic objectives.
2. Do you agree that a typical retail store collects huge volumes of data through its operational systems? Name three types of transaction data likely to be collected by a retail store in large volumes during its daily operations.
3. Examine the opportunities that can be provided by strategic information for a medical center. Can you list five such opportunities?
4. Why were all the attempts by IT to provide strategic information failures? List three concrete reasons and explain.
5. Describe five difference between operational systems and informational systems.
6. Why are operational systems not suitable for providing strategic information? Give three specific reasons and explain?
7. Name six characteristics of the computing environment needed to provide strategic information?
8. What types of processing takes place in data warehouse? Describe
9. A data warehouse is an environment, not a product. Discuss.
10.Data warehousing is the only viable means to resolve the information crisis and to provide strategic information. List four reasons to support this assertion and explain them.
11.The current trends in hardware/software technology make data warehousing much more feasible. Explain

Business Cases:

1. You are the IT Director of a nationwide insurance company. Write a memo to the Executive Vice President explaining the types of opportunities that can be realized with readily available strategic information.
2. For an airlines company, how can strategic information increases the number of frequent flyers? Discuss giving specific details.
3. You are a Senior Analyst in the IT department of a company manufacturing automobile parts. The marketing VP is complaining about the poor response by IT in    providing strategic information. Draft a proposal to him explaining he reasons for the problems and why a data warehouse would be the only viable solution.

Sunday, July 21, 2013

The Fuzzy Lookup Transformation

A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key.
The Lookup transformation returns either an exact match or nothing from the reference table, while the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches from the reference table.

Fuzzy lookup can be used for cleansing the data or to check the quality before loading data to the destination.
For e.g. let us consider a scenario where the Employee data coming from the source file may have some of the Employee Names that are misspelt or have bad characters. As per business requirement we want the Source data to be cleansed before loading it to destination. The source data will be matched with the Reference data set(Having correct data entries) and the close or perfect match result set will be loaded in the destination table.

Matching of the records of Primary(source) dataset and reference dataset is done by configuring the “Similarity Threshold” scale in the Fuzzy Look up editor which is scaled from 0-1.

               


Happy Learning!!!

The Lookup Transformation:

The Lookup transformation performs lookups by joining data in input columns(Primary dataset) with columns in a reference dataset. The reference dataset can be an existing table, file etc.
The LookUp is similar to a Merge Join transformation but is more efficient when dealing with larger data sets.

Before we dig into the further technical details let us try to understand LookUp through  an example.

Consider a scenario where we have an information of all the Indian sports personalities along with the sport they are associated with(Primary Dataset).
In the second dataset we have information of all the events that are going to be part of the London Olympics 2012(Reference dataset).
We want the details of only those sports personalities who can participate in London Olympics 2012 to be transferred to another Database for further analysis.

                              


In order to achieve the above objective we will use the Lookup transformation where the Primary Data set will be the input to it and Reference dataset needs to be configured/specified in the Lookup transformation along the with the Lookup column(in this case “Event_Name”).


Additional Info
·         The Lookup column in both Primary and reference datasets must be identical(same type).
·         If, in the reference table we have multiple entries(duplicates in the Lookup column), the lookup will quit searching as soon as it finds first match.
·         If a reference is not found in the lookup(in this case “Cricket”), we can configure the lookup to either Redirect those rows to another destination (or) Fail component (or) simply Ignore it.
·         Lookup also provides options to cache the reference dataset locally which helps in significantly improving the performance of the package.
(We have three options a) Full Cache b) Partial Cache c) No Cache).

Multicast, Union All and Row Count Transformations

The Multicast Transformation:
Multicast transformation creates a copy of the dataset that is passed to it.
If there is a scenario where we pull data from a Source but want to send that to two or more destinations, Multicast is the way to go.

Implementing a Multicast to copy Employee data coming from OLEDB source to load 3 different files.


The UNION ALL Transformation:
This is quite the opposite of Multicast transformation.
UNION ALL is used to merge multiple datasets into one.
(This is similar to SQL’s “UNION ALL”).

                                                            Combine(UNION ALL) data coming from Flat file and Excel and insert into an OLEDB Destination



Row Count:
A Row count transformation stores the number of rows passing through it in a SSIS variable.
This information can be used to functionally manipulate the flow of the package.

 
           The Row count transformation will store the actual number records that passed through it in a variable “File_Row_Count”(highlighted above).

In the above scenario: if we can decide on further execution of the package based value stored in the “File_Row_Count” variable.
(In most cases we abandon the further processing of the file, if we know that there was no data passed into it).

Derived Column, Copy Column and Data Conversion Transformation

Derived Column Transformation:

A widely used transformation, The Derived Column, helps create new column values by applying expressions to transformation input columns where the expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value.

E.g. Assume a scenario where we need to bring employee data from file and load into a table. There is a column in the Table “FullName” which should be loaded as “FirstName+LastName”, where “FirstName” and “LastName” are attributes coming from the Source file. Using Derive Column, we can create a new column FullName=FirstName+LastName on the fly(in SSIS buffer) and map that to the Destination table.

                           

Copy Column Transformation: 
This transformation is used if we need to create a copy of a column on the fly.


Data Conversion Transformation: 
This transformation is used to convert the DataType of Columns while transferring the data.

Note:
The functionalities achieved by CopyColumn or a Data Conversion transformation, can be implemented through a Derived Column Transformation, but if the requirement is specific i.e. only create a pure copy of the column without any manipulations or just convert the DataType of a column then it is advisable to use CopyColumn or DataConversion transformations for improved performance.

(A sophisticated mobile/car may have multiple features but will come at a cost. The object size of the Derived Column transformation will be more since it supports multiple features, hence if the functionality is a simple one and can achieved by simpler(and single) transformation then from a performance perspective it always desirable to go with the latter).

Merge, Merge Join and Union All Transformations

Let us continue our exploration on some more commonly used Data Flow Transformation used in SSIS.

In this article we shall look into Merge and the Merge Join Transformation.

(Although the names may look similar but from a functionality perspective they are entirely different).

The Merge Transformation:
The Merge Transformation merges two sorted data sets and outputs a sorted data set.
type.
E.g.: Let us consider a scenario where we have the sales data coming from India and US geographies and the requirement is to have consolidated output data sorted on Product Name.
In the below case the Unsorted data from the two sources will be sorted using a Sort transformation and will become the inputs to the Merge transformation, which will consolidate the two data set and will give sorted output.

                                            Data Flow Designer                                                                                                     

     

Quick peek at the actual Data

The Merge transformation requires that the merged columns in its inputs have matching metadata.
For example, we cannot merge a column that has a numeric data type with a column that has a character data

The Merge Join Transformation:
The Merge Join Transformation joins two data sets on a common key (or keys).
It is similar to the SQL Join like Inner Join, Full Outer Join and Left Outer Join, however the two input datasets can belong to two different sources altogether.
The Merge Join Transformation requires sorted data for its inputs and the sorting should be done on the column that would be used as Join Key.

E.g. Consider a scenario where we have a Table A(Employee) with columns EmpId, Emp_Name and Dept_Id and Table B(Department) has Dept_Id and Dept_Name. The requirement here is to have final data(Stored in SQL table) showing EmpId and Emp_Name from Table A and Dept_Name from the Table B.
In order to achieve the above objective we will use Merge Transformation which will join(Inner Join) the two data set on the basis of common key(Dept_Id) column and also provides the option to select the columns which we want in the final result set.

                                         Data Flow Designer                                            Quick peek at the actual Data
                                 
*Please note that:
·         We can avoid the Sort transformation if we are providing a sorted input to Merge Join Transformation. However we need to make sure that IsSorted Property in the SourceComponent is set to true.
·         In the Source component of a Merge Join it is necessary to configure the SortKey Position of the Join-Key column(Dept_Id in this case) to “1”.

The Union All Transformation:
The Union All transformation also provides the same functionality which Merge Transformation do. But they differ in below aspect
1.       In order to use Union All we don’t need to have a Sorted input as in the Merge Join.
2.       Union All can have more than two Input datasets while Merge transformation can handle only two.
3.       Output of the Union ALL is not sorted.

Transformations in a SSIS Data Flow Task

If we are trying to relate the keyword “Transformation” as in “Extraction Transformation Load” (ETL) to the Transformations in a Data Flow Task(this article), then we are absolutely correct!!!!
As the name suggests, these Transformation components that are available in a Data Flow Task are used to Transform the Data/Metadata while the data is flowing from a Source to a Destination(on the fly).

The term “Transformations” has a very wide scope in SSIS. Each transformation component is different from another and serves different purpose. For e.g. there are transformations that will aggregate the data hence changing the granularity of the Data Set, some can Union the data coming from multiple Sources and push it into one destination, some are used to filter the data etc.

Let us begin by looking into one of the commonly used(and easy to understand) Transformations is SSIS:

 The Conditional Split Transformation

Conditional Split can be defined as a component that will route data rows to different outputs depending on the content of the data.
(It is similar to a CASE decision structure in a programming language).

Consider a scenario where a flat file contains consolidated sales data and we would like to split the data based on a Gender Column coming present in source so that it can be pumped it into different destinations for carrying out further analysis.


In the above flow we are using a Conditional Split transformation to divide the data based on the gender.
One set goes into an OLEDB table(Destination 1) and the other into an Excel file(Destination 2).

How this works(functionally):
1.       Read all the data in the Flat File through the Flat File Source Component.
2.       Each record will pass through the Conditional Split transformation and will be parsed-One by one.
3.       The Conditional Split transformation will be configured in such a way that will check the value in the “Gender” column of the record.
From a functional perspective, Expression used here would be {Is the Gender_Column=”Female” or “Male”}
4.       This expression would be checked for each record that is passing through this transformation and based on the result the record would be sent to the appropriate destination.
5.       If the value is “Female” the record will be sent to OLEDB table; If the value is “Male” the record will be sent to the Excel file.    

In the coming articles, we shall explore more on the major/highly used Transformation components in SSIS.

Precedence Constraints

In the previous articles while looking into the Control Flow tasks we would have observed pointed arrows that connect these tasks. These arrows are not merely for directing the Workflow of a SSIS package but have much more functionality embedded in them.

To begin with ,in SSIS these pointed arrows that link control flow tasks are called Precedence Constraints.

Precedence constraints(PC’s) link the individual executable(Control Flow Task) together and determine how the workflow moves from one executable to the next (An executable can be a For Loop, Foreach Loop, or Sequence container, a control flow task or an event handler).

The linked executable can be divided as :a) Precedence executable (the task from where the Precedence constraint originates)
b) Constrained executable (the task from where the Precedence constraint terminates/points to)

Work flow is defined in the control flow using Precedence Constraint Editor which provides an option to define the work flow in two ways-

1)       Defining Work Flow by Success or Failure or Completion-  This can be done by setting the Value property in the Precedence Constrain editor which provide three options-
§  Success: The precedence executable must run successfully for the constrained executable to run. PC is set to green when the Success option is selected.
§  Failure: The precedence executable must fail for the constrained executable to run. PC is set to red when the Failure option is selected.
§  Completion: The constrained executable will run after the precedence executable runs, whether the precedence executable runs successfully or whether it fails. PC is set to blue in this case.

                   

2)       Defining Work Flow by Expressions-  This can be done by setting the Evaluation Operation property in the Precedence Constrain editor which provide below options-
§  Constraint: The precedence constraint is evaluated solely on the option selected in the Value property.
§  Expression: The precedence constraint is evaluated based on the expression defined in the expression text box. If the expression evaluates to true, the workflow continues and the constrained executable runs. If the expression evaluates to false, the constrained executable does not run.
§  Expression and Constraint: The precedence constraint is evaluated based on both the Value property and the expression. Both must evaluate to true for the constrained executable to run.
§  Expression or Constraint: The precedence constraint is evaluated based on either the Value property or the expression.

Apart from the above properties we have one more property in the Precedence Constraint Editor Logical AND and Logical OR. This property will come into the picture when constrained task will have multiple constraints.

                                                                                                Logical AND                                                                                         Logical OR
Description: cid:image015.jpg@01CD6F3B.F99BE6F0                              Description: cid:image016.jpg@01CD6F3B.F99BE6F0                                                
                                    (Task ‘C’ will execute when Task A and Task B completes successfully)       (Task ‘C’ will execute when either of Task A or Task B completes successfully)                                               


Please note that, two tasks shall execute in parallel, if they are not connected through precedence constraints.

SSIS Containers

Hope the last article gave a good idea about some of the important control flow tasks.
Today we shall look into the major Control Flow Containers used in SSIS.

1.       For Loop Container-
The For Loop Container will iterate over tasks that we put inside the container for a predetermined number of times, or until a condition is met
(It is similar to the “For Loop” that we have in programming languages).
As in a typical Programming language expression “For (i=0; i<11; i++)”, we have 3 parameters that need to configured in For Loop Container-
a.       InitExpression      (equivalent to i=0)
b.       EvalExpression    (equivalent to i<11)
c.        AssignExpression (equivalent to i++)

E.g.- Let us assume a scenario where we need to download 10 files from a remote location directory every time the package executes;

       Implemented through a “For Loop Container” having a FTP task (as shown below).

Description: cid:image005.jpg@01CD6E6F.D88236D0           Description: cid:image006.jpg@01CD6E6F.D88236D0

2.       Foreach Loop Container-
The Foreach Loop container is used to iterate over a collection of objects (say a directory of files or a Recordset), performing tasks that are placed inside the container for every iteration. The loop implementation is similar to Foreach looping structure in programming languages.

E.g.- Rename all the files in a folder; Implemented through a “Foreach loop container” having a File system task in it (as shown below).

The Foreach loop container will iterate over all the files in that folder and the File System Task will rename each file the Foreach Loop iterates through.

Description: cid:image008.jpg@01CD6E6F.D88236D0

3.       Sequence Container-
The Sequence container defines a control flow that is a subset of the package control flow. Sequence containers group the package into multiple separate control flows, each containing one or more tasks and containers. Sequence Containers are used to logically group a set of control flow tasks.

E.g.- Load data for different countries (stored in separate locations) into respective tables.



Note- All the above container will execute in parallel


*The SSIS Package is a container in itself (highest level container).

SSIS Control Flow Tasks

After having a Byte of Data Flow Task in our last article, we shall look into some of the other frequently used SSIS Control Flow Tasks.


1.       EXECUTE SQL TASK - 
This task is used to run T-SQL queries from  a SSIS package.
The task also provides provision to pass and receive parameters.
Supports the following connection types:  EXCEL, OLE DB, ODBC, ADO, ADO.NET or SQLMOBILE


2.       FILE SYSTEM TASK  – 
Whenever there is a need to perform File system operations, this task comes into the picture.
E.g. Rename, Delete, Move, Copy files/directories.



3.       EXECUTE PACKAGE TASK - 
This task is used to call on SSIS package from another.
We can also pass the variables from the parent package to the child package.


4.       FTP TASK - 
(File Transfer Protocol)Used to send or receive files from a Remote Location.
               

5.       Script Task - 
Whenever there is a requirement that cannot be met by Out of the Box features in SSIS , we can use the script task as it gives us the flexibility to use the extensive set of classes in .Net framework.
We can either use VB.net or C# to code custom scripts.
For e.g. Reading file properties like Created date or Size of the File; Trying to catch hold of a specific string in a random non-formatted text file.  

Happy Learning!