Satish MSBI World: Parallel Processing in SSIS with Example

Friday 17 April 2015

Parallel Processing in SSIS with Example

Parallel Processing in SSIS

In this article, I will explain how many types of parallel processing available in ssis and explain briefly by using one example.

SQL Server Integration Services (SSIS) allows parallel execution. The idea of parallel design is to break a large amount of data into smaller, independent pieces, worked on it. In SSIS services breaking the large amount of data into smaller pieces by using tasks to partitioning the data to be processed. Finally, you need to take the small pieces of independent work and complete the work done faster.

It has specified into two different ways.
• Control Flow Parallelism
• Data Flow Parallelism

Control Flow Parallelism

The first one is MaxConcurrentExecutables, a property of the package. It defines how many tasks (executables) can run simultaneously. The default value for this property is -1, equalling to the total number of processors plus two. You can increase this number in order to allow for a greater degree of parallelism.

This could be good strategy to use if you have multiple packages that pull from multiple sources, have relatively basic transformations, and /or push the data to multiple destinations.

Example: To transfer 8 lakh’s records from one database to another database by using more than one Data Flow task.

Suppose we have a package 8 Data Flow Tasks.

Each Data Flow task (DFT) to transfer 1 lakh records from Source to destination as follows

OLE DB source -->OLE DB Destination

DFT 1 contains 1 to 1 lakh records.
DFT 2 contains 1 lakh to 2 lakh records.
DFT 3 contains 2 lakh to 3 lakh records.
DFT 4 contains 3 lakh to 4 lakh records.
DFT 5 contains 4 lakh to 5 lakh records.
DFT 6 contains 5 lakh to 6 lakh records.
DFT 7 contains 6 lakh to 7 lakh records.
DFT 8 contains 7 lakh to 8 lakh records.

Once u run the package with the default property of MaxConcurrentExecutables: -1

It takes to complete 7.847 seconds.

Now u put the value MaxConcurrentExecutables: 10 and run the package

It gives the result within 3.1657 seconds.

Data Flow Parallelism

The same principals in Control Flow Parallelism apply to the Data Flow. The equivalent property to “MaxConcurrentExecutables” in the Data Flow is called “EngineThreads”. This property is set to 10 by default, but this does not mean that a data flow will use 10 processor threads, but rather that the data flow will use up to 10 threads in order to efficiently process the items in the data flow. Depending on the context of the performance of your package, you may find that partitioning your sources and destinations will increase performance of inserting records, but make sure that there are no dependencies in your data flow.

In some cases it cause a deadlock you will quickly lose whatever performance gain you have achieved.

Example: To transfer 8 lakh’s records from one database to another database by using one Data Flow task.

Suppose we have a package that contains only one Data Flow Task and that Task contains 8 Flows in the form of OLE DB source --> OLE DB Destination.