When you architect data integration solutions, your design
decisions not only determine how successfully your solution meets functional
requirements, but also how well your solution meets performance requirements.
To make the right performance design decisions, you need to understand the
performance architecture of your data integration tool and, just as
importantly, the techniques that enable you to maximize the tool’s utilization
of system resources such as memory and CPU.
SSIS provides a variety of optimization opportunities to help
you maximize resource utilization while successfully meeting the needs of your
specific data integration scenario.
Before you can take advantage of specific tuning techniques
in SSIS packages, it is important to familiarize yourself with the SSIS
architecture. This architecture consists of two major components: the run-time
engine and the data flow engine.
The run-time engine
The run-time engine is a highly parallel control flow engine
that coordinates the execution of tasks or units of work within SSIS and
manages the engine threads that carry out those tasks. For the most part, the
performance of the run-time engine is most heavily influenced by conditions
external to SSIS, such as network bandwidth, and interaction with external
systems such as database servers, FTP servers, or email servers. When SSIS runs
an Execute SQL Task, for example, it sends a call to the target database and
then waits for a response from the database server before it continues. In this
scenario, the performance of the Execute SQL Task is more dependent on the
performance of the query execution than on the SSIS run-time engine.
The data flow engine
When you use SSIS for data integration, in addition to the
run-time engine, you use the data flow engine that manages the data pipeline.
The data flow engine is invoked by a special task in SSIS called the Data Flow
task. When the Data Flow task executes, the SSIS data flow engine extracts data
from one or more data sources, performs any necessary transformations on the
extracted data, and then delivers that data to one or more destinations.
With data integration solutions, you will likely spend a
large part of your performance tuning time optimizing the data flow engine.
Like the run-time engine, the data flow engine is influenced by external
conditions; however, within SSIS, there are a variety of settings that you can
manipulate to tune the data flow engine’s performance based on the requirements
of the data integration operations.
Buffer Usage
Behind the scenes, the data flow engine uses a
buffer-oriented architecture to efficiently load and manipulate datasets in
memory. The benefit of this in-memory processing is that you do not need to
physically copy and stage data at each step of the data integration. Rather,
the data flow engine manipulates data as it is transferred from source to
destination.
As this data flows through the pipeline, SSIS attempts to
reuse data from prior buffers as much as possible when additional operations
are performed. How buffers are used and reused depend on the type of
transformations that you use in a pipeline.
Row Transformations - Row transformations either
manipulate data or create new fields using the data that is available in that
row. Examples of SSIS components that perform row transformations include
Derived Column, Data Conversion, Multicast, and Lookup. While these components
might create new columns, row transformations do not create any additional
records. Because each output row has a 1:1 relationship with an input row, row
transformations are also known as synchronous transformations. Row
transformations have the advantage of reusing existing buffers and do not
require data to be copied to a new buffer to complete the transformation.
Partially blocking transformations - Partially
blocking transformations are often used to combine datasets. They tend to have
multiple data inputs. As a result, their output may have the same, greater, or
fewer records than the total number of input records. Since the number of input
records will likely not match the number of output records, these
transformations are also called asynchronous transformations. Examples
of partially blocking transformation components available in SSIS include
Merge, Merge Join, and Union All. With partially blocking transformations, the
output of the transformation is copied into a new buffer and a new thread may
be introduced into the data flow.
Blocking transformations - Blocking transformations
must read and process all input records before creating any output records. Of
all of the transformation types, these transformations perform the most work
and can have the greatest impact on available resources. Example components in
SSIS include Aggregate and Sort. Like partially blocking transformations,
blocking transformations are also considered to be asynchronous. Similarly,
when a blocking transformation is encountered in the data flow, a new buffer is
created for its output and a new thread is introduced into the data flow.
Transformations are not the only components that can be
categorized as synchronous or asynchronous. Sources are a special type of
asynchronous component. For example, an RDBMS source component creates two
types of buffers: one for the Success output and one for the Error output. By
contrast, destinations are a special type of synchronous component. You will
see the interactions of source and destinations components when you examine the
Execution Trees of a package.
Execution Trees
Execution trees demonstrate how your package uses buffers and
threads. At run time, the data flow engine breaks down Data Flow task
operations into execution trees. These execution trees specify how buffers and
threads are allocated in the package. Each tree creates a new buffer and may
execute on a different thread. When a new buffer is created such as when a
partially blocking or blocking transformation is added to the pipeline,
additional memory is required to handle the data transformation; however, it is
important to note that each new tree may also give you an additional worker
thread.
Examine the execution trees in the example depicted in Figure
1 and Table 1 where two Employee datasets are combined together and then
aggregated to load into a common destination table.
In Execution Tree 2, SSIS reads data from the Employee
OLE DB Source into the pipeline, a Derived Column transformation adds another
column, and SSIS passes data to the Union All transformation. All of the
operations in this execution tree use the same buffer; data is not copied again
once it is read into the OLE DB Source Output.
In Execution Tree 3, SSIS creates a buffer to hold
error records from the asynchronous Employee OLE DB Source before loading them
into a destination error table.
In Execution Tree 4, SSIS reads data from the Employee
Flat File Source and passes it to the Union All. These two operations use the
same buffer.
In Execution Tree 5, a buffer is created to hold
errors from the asynchronous Employee Flat File Source before loading them into
a destination error table.
In Execution Tree 0, the Partially Blocking Union All
transformation is executed and a new buffer is created to store the combined
data and the aggregate is calculated.
In Execution Tree 1, after the Fully Blocking
Aggregate transformation is completed, the output from the Aggregate operation
is copied into a new buffer and data is loaded into the OLE DB Destination.
This example demonstrates how execution trees can help you
understand buffer usage in a common SSIS package. This example also highlights
how Partially Blocking transformations like Union All and Fully Blocking
transformations like Aggregate create new buffers and threads whereas Row
Transformations like Derived Column do not.
Execution trees are enormously valuable in understanding
buffer usage. You can display execution trees for your own packages by turning
on package logging, enabling logging for the Data Flow task, and then selecting
the Pipeline Execution Tree event. Note that you will not see the execution
trees until you execute the package. When you do execute the package, the
execution trees appear in the Log Events window in Business Intelligence (BI)
Development Studio.
Writers: Elizabeth Vitt, Intellimentum and Hitachi Corporation
Contributors: Donald Farmer, Microsoft Corporation; Ashvini Sharma, Microsoft Corporation; Stacia Misner, Hitachi Consulting
Writers: Elizabeth Vitt, Intellimentum and Hitachi Corporation
Contributors: Donald Farmer, Microsoft Corporation; Ashvini Sharma, Microsoft Corporation; Stacia Misner, Hitachi Consulting
No comments:
Post a Comment