Sunday, July 21, 2013

The Fuzzy Lookup Transformation

A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key.
The Lookup transformation returns either an exact match or nothing from the reference table, while the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches from the reference table.

Fuzzy lookup can be used for cleansing the data or to check the quality before loading data to the destination.
For e.g. let us consider a scenario where the Employee data coming from the source file may have some of the Employee Names that are misspelt or have bad characters. As per business requirement we want the Source data to be cleansed before loading it to destination. The source data will be matched with the Reference data set(Having correct data entries) and the close or perfect match result set will be loaded in the destination table.

Matching of the records of Primary(source) dataset and reference dataset is done by configuring the “Similarity Threshold” scale in the Fuzzy Look up editor which is scaled from 0-1.

               


Happy Learning!!!

The Lookup Transformation:

The Lookup transformation performs lookups by joining data in input columns(Primary dataset) with columns in a reference dataset. The reference dataset can be an existing table, file etc.
The LookUp is similar to a Merge Join transformation but is more efficient when dealing with larger data sets.

Before we dig into the further technical details let us try to understand LookUp through  an example.

Consider a scenario where we have an information of all the Indian sports personalities along with the sport they are associated with(Primary Dataset).
In the second dataset we have information of all the events that are going to be part of the London Olympics 2012(Reference dataset).
We want the details of only those sports personalities who can participate in London Olympics 2012 to be transferred to another Database for further analysis.

                              


In order to achieve the above objective we will use the Lookup transformation where the Primary Data set will be the input to it and Reference dataset needs to be configured/specified in the Lookup transformation along the with the Lookup column(in this case “Event_Name”).


Additional Info
·         The Lookup column in both Primary and reference datasets must be identical(same type).
·         If, in the reference table we have multiple entries(duplicates in the Lookup column), the lookup will quit searching as soon as it finds first match.
·         If a reference is not found in the lookup(in this case “Cricket”), we can configure the lookup to either Redirect those rows to another destination (or) Fail component (or) simply Ignore it.
·         Lookup also provides options to cache the reference dataset locally which helps in significantly improving the performance of the package.
(We have three options a) Full Cache b) Partial Cache c) No Cache).

Multicast, Union All and Row Count Transformations

The Multicast Transformation:
Multicast transformation creates a copy of the dataset that is passed to it.
If there is a scenario where we pull data from a Source but want to send that to two or more destinations, Multicast is the way to go.

Implementing a Multicast to copy Employee data coming from OLEDB source to load 3 different files.


The UNION ALL Transformation:
This is quite the opposite of Multicast transformation.
UNION ALL is used to merge multiple datasets into one.
(This is similar to SQL’s “UNION ALL”).

                                                            Combine(UNION ALL) data coming from Flat file and Excel and insert into an OLEDB Destination



Row Count:
A Row count transformation stores the number of rows passing through it in a SSIS variable.
This information can be used to functionally manipulate the flow of the package.

 
           The Row count transformation will store the actual number records that passed through it in a variable “File_Row_Count”(highlighted above).

In the above scenario: if we can decide on further execution of the package based value stored in the “File_Row_Count” variable.
(In most cases we abandon the further processing of the file, if we know that there was no data passed into it).

Derived Column, Copy Column and Data Conversion Transformation

Derived Column Transformation:

A widely used transformation, The Derived Column, helps create new column values by applying expressions to transformation input columns where the expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value.

E.g. Assume a scenario where we need to bring employee data from file and load into a table. There is a column in the Table “FullName” which should be loaded as “FirstName+LastName”, where “FirstName” and “LastName” are attributes coming from the Source file. Using Derive Column, we can create a new column FullName=FirstName+LastName on the fly(in SSIS buffer) and map that to the Destination table.

                           

Copy Column Transformation: 
This transformation is used if we need to create a copy of a column on the fly.


Data Conversion Transformation: 
This transformation is used to convert the DataType of Columns while transferring the data.

Note:
The functionalities achieved by CopyColumn or a Data Conversion transformation, can be implemented through a Derived Column Transformation, but if the requirement is specific i.e. only create a pure copy of the column without any manipulations or just convert the DataType of a column then it is advisable to use CopyColumn or DataConversion transformations for improved performance.

(A sophisticated mobile/car may have multiple features but will come at a cost. The object size of the Derived Column transformation will be more since it supports multiple features, hence if the functionality is a simple one and can achieved by simpler(and single) transformation then from a performance perspective it always desirable to go with the latter).

Merge, Merge Join and Union All Transformations

Let us continue our exploration on some more commonly used Data Flow Transformation used in SSIS.

In this article we shall look into Merge and the Merge Join Transformation.

(Although the names may look similar but from a functionality perspective they are entirely different).

The Merge Transformation:
The Merge Transformation merges two sorted data sets and outputs a sorted data set.
type.
E.g.: Let us consider a scenario where we have the sales data coming from India and US geographies and the requirement is to have consolidated output data sorted on Product Name.
In the below case the Unsorted data from the two sources will be sorted using a Sort transformation and will become the inputs to the Merge transformation, which will consolidate the two data set and will give sorted output.

                                            Data Flow Designer                                                                                                     

     

Quick peek at the actual Data

The Merge transformation requires that the merged columns in its inputs have matching metadata.
For example, we cannot merge a column that has a numeric data type with a column that has a character data

The Merge Join Transformation:
The Merge Join Transformation joins two data sets on a common key (or keys).
It is similar to the SQL Join like Inner Join, Full Outer Join and Left Outer Join, however the two input datasets can belong to two different sources altogether.
The Merge Join Transformation requires sorted data for its inputs and the sorting should be done on the column that would be used as Join Key.

E.g. Consider a scenario where we have a Table A(Employee) with columns EmpId, Emp_Name and Dept_Id and Table B(Department) has Dept_Id and Dept_Name. The requirement here is to have final data(Stored in SQL table) showing EmpId and Emp_Name from Table A and Dept_Name from the Table B.
In order to achieve the above objective we will use Merge Transformation which will join(Inner Join) the two data set on the basis of common key(Dept_Id) column and also provides the option to select the columns which we want in the final result set.

                                         Data Flow Designer                                            Quick peek at the actual Data
                                 
*Please note that:
·         We can avoid the Sort transformation if we are providing a sorted input to Merge Join Transformation. However we need to make sure that IsSorted Property in the SourceComponent is set to true.
·         In the Source component of a Merge Join it is necessary to configure the SortKey Position of the Join-Key column(Dept_Id in this case) to “1”.

The Union All Transformation:
The Union All transformation also provides the same functionality which Merge Transformation do. But they differ in below aspect
1.       In order to use Union All we don’t need to have a Sorted input as in the Merge Join.
2.       Union All can have more than two Input datasets while Merge transformation can handle only two.
3.       Output of the Union ALL is not sorted.

Transformations in a SSIS Data Flow Task

If we are trying to relate the keyword “Transformation” as in “Extraction Transformation Load” (ETL) to the Transformations in a Data Flow Task(this article), then we are absolutely correct!!!!
As the name suggests, these Transformation components that are available in a Data Flow Task are used to Transform the Data/Metadata while the data is flowing from a Source to a Destination(on the fly).

The term “Transformations” has a very wide scope in SSIS. Each transformation component is different from another and serves different purpose. For e.g. there are transformations that will aggregate the data hence changing the granularity of the Data Set, some can Union the data coming from multiple Sources and push it into one destination, some are used to filter the data etc.

Let us begin by looking into one of the commonly used(and easy to understand) Transformations is SSIS:

 The Conditional Split Transformation

Conditional Split can be defined as a component that will route data rows to different outputs depending on the content of the data.
(It is similar to a CASE decision structure in a programming language).

Consider a scenario where a flat file contains consolidated sales data and we would like to split the data based on a Gender Column coming present in source so that it can be pumped it into different destinations for carrying out further analysis.


In the above flow we are using a Conditional Split transformation to divide the data based on the gender.
One set goes into an OLEDB table(Destination 1) and the other into an Excel file(Destination 2).

How this works(functionally):
1.       Read all the data in the Flat File through the Flat File Source Component.
2.       Each record will pass through the Conditional Split transformation and will be parsed-One by one.
3.       The Conditional Split transformation will be configured in such a way that will check the value in the “Gender” column of the record.
From a functional perspective, Expression used here would be {Is the Gender_Column=”Female” or “Male”}
4.       This expression would be checked for each record that is passing through this transformation and based on the result the record would be sent to the appropriate destination.
5.       If the value is “Female” the record will be sent to OLEDB table; If the value is “Male” the record will be sent to the Excel file.    

In the coming articles, we shall explore more on the major/highly used Transformation components in SSIS.

Precedence Constraints

In the previous articles while looking into the Control Flow tasks we would have observed pointed arrows that connect these tasks. These arrows are not merely for directing the Workflow of a SSIS package but have much more functionality embedded in them.

To begin with ,in SSIS these pointed arrows that link control flow tasks are called Precedence Constraints.

Precedence constraints(PC’s) link the individual executable(Control Flow Task) together and determine how the workflow moves from one executable to the next (An executable can be a For Loop, Foreach Loop, or Sequence container, a control flow task or an event handler).

The linked executable can be divided as :a) Precedence executable (the task from where the Precedence constraint originates)
b) Constrained executable (the task from where the Precedence constraint terminates/points to)

Work flow is defined in the control flow using Precedence Constraint Editor which provides an option to define the work flow in two ways-

1)       Defining Work Flow by Success or Failure or Completion-  This can be done by setting the Value property in the Precedence Constrain editor which provide three options-
§  Success: The precedence executable must run successfully for the constrained executable to run. PC is set to green when the Success option is selected.
§  Failure: The precedence executable must fail for the constrained executable to run. PC is set to red when the Failure option is selected.
§  Completion: The constrained executable will run after the precedence executable runs, whether the precedence executable runs successfully or whether it fails. PC is set to blue in this case.

                   

2)       Defining Work Flow by Expressions-  This can be done by setting the Evaluation Operation property in the Precedence Constrain editor which provide below options-
§  Constraint: The precedence constraint is evaluated solely on the option selected in the Value property.
§  Expression: The precedence constraint is evaluated based on the expression defined in the expression text box. If the expression evaluates to true, the workflow continues and the constrained executable runs. If the expression evaluates to false, the constrained executable does not run.
§  Expression and Constraint: The precedence constraint is evaluated based on both the Value property and the expression. Both must evaluate to true for the constrained executable to run.
§  Expression or Constraint: The precedence constraint is evaluated based on either the Value property or the expression.

Apart from the above properties we have one more property in the Precedence Constraint Editor Logical AND and Logical OR. This property will come into the picture when constrained task will have multiple constraints.

                                                                                                Logical AND                                                                                         Logical OR
Description: cid:image015.jpg@01CD6F3B.F99BE6F0                              Description: cid:image016.jpg@01CD6F3B.F99BE6F0                                                
                                    (Task ‘C’ will execute when Task A and Task B completes successfully)       (Task ‘C’ will execute when either of Task A or Task B completes successfully)                                               


Please note that, two tasks shall execute in parallel, if they are not connected through precedence constraints.

SSIS Containers

Hope the last article gave a good idea about some of the important control flow tasks.
Today we shall look into the major Control Flow Containers used in SSIS.

1.       For Loop Container-
The For Loop Container will iterate over tasks that we put inside the container for a predetermined number of times, or until a condition is met
(It is similar to the “For Loop” that we have in programming languages).
As in a typical Programming language expression “For (i=0; i<11; i++)”, we have 3 parameters that need to configured in For Loop Container-
a.       InitExpression      (equivalent to i=0)
b.       EvalExpression    (equivalent to i<11)
c.        AssignExpression (equivalent to i++)

E.g.- Let us assume a scenario where we need to download 10 files from a remote location directory every time the package executes;

       Implemented through a “For Loop Container” having a FTP task (as shown below).

Description: cid:image005.jpg@01CD6E6F.D88236D0           Description: cid:image006.jpg@01CD6E6F.D88236D0

2.       Foreach Loop Container-
The Foreach Loop container is used to iterate over a collection of objects (say a directory of files or a Recordset), performing tasks that are placed inside the container for every iteration. The loop implementation is similar to Foreach looping structure in programming languages.

E.g.- Rename all the files in a folder; Implemented through a “Foreach loop container” having a File system task in it (as shown below).

The Foreach loop container will iterate over all the files in that folder and the File System Task will rename each file the Foreach Loop iterates through.

Description: cid:image008.jpg@01CD6E6F.D88236D0

3.       Sequence Container-
The Sequence container defines a control flow that is a subset of the package control flow. Sequence containers group the package into multiple separate control flows, each containing one or more tasks and containers. Sequence Containers are used to logically group a set of control flow tasks.

E.g.- Load data for different countries (stored in separate locations) into respective tables.



Note- All the above container will execute in parallel


*The SSIS Package is a container in itself (highest level container).

SSIS Control Flow Tasks

After having a Byte of Data Flow Task in our last article, we shall look into some of the other frequently used SSIS Control Flow Tasks.


1.       EXECUTE SQL TASK - 
This task is used to run T-SQL queries from  a SSIS package.
The task also provides provision to pass and receive parameters.
Supports the following connection types:  EXCEL, OLE DB, ODBC, ADO, ADO.NET or SQLMOBILE


2.       FILE SYSTEM TASK  – 
Whenever there is a need to perform File system operations, this task comes into the picture.
E.g. Rename, Delete, Move, Copy files/directories.



3.       EXECUTE PACKAGE TASK - 
This task is used to call on SSIS package from another.
We can also pass the variables from the parent package to the child package.


4.       FTP TASK - 
(File Transfer Protocol)Used to send or receive files from a Remote Location.
               

5.       Script Task - 
Whenever there is a requirement that cannot be met by Out of the Box features in SSIS , we can use the script task as it gives us the flexibility to use the extensive set of classes in .Net framework.
We can either use VB.net or C# to code custom scripts.
For e.g. Reading file properties like Created date or Size of the File; Trying to catch hold of a specific string in a random non-formatted text file.  

Happy Learning!

Data Flow Task

In the last couple of articles we explored the internal architecture of SSIS and SSIS Designer.
In this article we shall look into one of the most important task in SSIS control Flow that drives actual data transfer - The Data Flow Task

Though Data Flow Task(DFT) may seem like just another task in the Control Flow Toolbox pane, but the scope/potential of this task is much higher than most of the other tasks.
As the name suggests this is where we define the flow of data from Source(s) to Destination(s).

Let us explore the Data Flow Task’s window/designer and understand the various components in it.

Consider a scenario where we have an Employee data stored in a text file which needs to be sorted on Employee Name and eventually load into a SQL Server table.

Let’s see how we can accomplish the above objective using the SSIS Data Flow Task. And use this implementation to understand the various components in the DFT.


Tool Box – Data flow tool box provides three different types of components
a.       Source Components
b.       Transformation Components and
c.        Destination Components

Source Component- Data flow source components pull the data from different external data sources and make it available to the other components in the data flow. There are many data flow components available in the tool box like Excel Source , OLEDB Source. In the above scenario since we have the Employee data stored in a Text File we have used Flat File Source Component.

Data Flow Transformations- Data Flow Transformations are used to perform tasks such as updating, summarizing, cleaning, merging, and sorting of the data (on the go). In the above scenario we have used a SORT transformation.

Destination Component - A  Data flow Destination component will load the data to a specific data store. A data flow can include multiple destinations that load data into different data stores. For e.g. we can have Excel Destination , SQL Server Destination. In the above scenario we have used OLEDB destination that will write the data in the SQL Server Table.

Connection Manager(Source)- Connection Managers are primarily the objects/components that store the address of Source/Destination. In the above scenario Connection Manager(Source) will hold the information of the location of the source file, format of the file etc.

Connection Manager(Destination)- In this scenario we have used an OLEDB connection manager which will hold the connection string to the SQL server database where we have the table.

Please note that a Source or a Destination component will only point to a particular Connection Manager and it is the Connection Manager object that actually stores the physical information of the File/Database.
This indirectly means that a single connection manager object can be reused across multiple Source/Destination components.

We shall dig deep into various Source, Destination and Transformation components in the near future.