[DISCLAIMER] This blog post mainly talks about the benefits of
our commercial offering, but the practice should be generally applicable to other approaches or components that you might be currently using in SSIS. It is not my intention to turn my blog into a commercial space, but I do believe this blog post would help even your are using something different, assuming that SSIS is the tool for your data integration purpose.
[/DISCLAIMER]
If you have ever been engaged in any Microsoft Dynamics CRM data integration project, I am relatively sure that you have invested time and resources to tune your data integration component to its maximum possible performance so that it takes the least time to finish the data integration tasks.
This blog post shows you how to load one million records into Microsoft Dynamics CRM 2011 on-premise installation with a two-hour time range, using our product -
SSIS Integration Toolkit for Microsoft Dynamics CRM, by taking advantage of the
Balanced Data Distributor (BDD) component that Microsoft released to public community that works for SQL Server Integration Services (SSIS).
In case you don't know
BDD component, here is a little background information about the component. BDD is a data flow transformation component that takes a single input and evenly distributes the incoming rows to one or more outputs uniformly via multithreading. The purpose of BDD component is to maximize the output performance of ETL data flow tasks. BDD can be used when your downstream pipeline component (say the destination component) is the bottleneck of the entire data flow task.
When working with Microsoft Dynamics CRM data integration, we have a perfect reason to use BDD, mainly because writing data into CRM is slow due to the nature of web service interface. In other words, In most of cases, you would find that the CRM destination component which writes data into CRM is the bottleneck of your data flow tasks. Using BDD, we can distribute incoming rows from upstream pipeline components and split them into multiple CRM destination components, so they write data into CRM simultaneously and concurrently by taking advantage of the multi-threading capability of SSIS engine.
To demonstrate the benefits of using BDD component, I first tried a single CRM destination component in my data flow task without using BDD, so the data flow writes data into CRM using a single thread. It took me 5 hours, 48 minutes to finish the load of 1,000,000 record into CRM contact entity. Here is what the data flow task looks like.
The following screen shot shows how the data flow runs using dtexec command line.
Next, I tried to use BDD and split the input into 10 outputs so that we write to CRM contact entity using 10 concurrent threads. The data flow finishing loading 1,000,000 records within 2 hours, 3 minutes. Here is what the data flow task looks like.
The following screen shots shows how the data flow runs using dtexec command line.
The improvement is about 2.84 times, it's not surprising that it's not exactly 10 times faster.
A few facts
- This is not a scientific benchmark.
- My testing was conducted on a desktop computer of 4-year old which has everything installed in the single box. The following is the spec of the computer.
- Processor: Intel Core 2 Quad Q9550 @2.83GHz
- Memory: 8GB PC2-6400 DDR2-SDRAM
- Hard Disk: Seagate 7200RPM SATA 1.5Gb/s
- Operating System: Windows 2008 R2 Server
- Database Server: SQL Server 2008 R2
- Microsoft Dynamics CRM Server 2011 with Rollup 6
- SSIS Adapter: KingswaySoft SSIS Integration Toolkit for Microsoft Dynamics CRM
- The testing was done in an on-premise environment, your data load performance would be different if you are using CRM online or partner-hosted environment.
- I have intentionally used 64-bit dtexec.exe with the hope that we can take advantage of SSIS 64-bit run-time. Controversy to what I believed, running it using 32-bit dtexec.exe is actually not slower, but 10% faster than 64-bit runtime. The reason is probably related to the cost associated with memory addressing in 64-bit runtime.
- My input data is very simple, it has only two fields, firstname and lastname. When you have more fields, you would expect the data load performance to degrade in certain scale.
- I was hoping to be able to load 1 million records into CRM within one hour after using BDD, but it still took me two hours. With a better IO system and more computer power, I am relatively positive that the goal (one million records within one hour) is achievable.
- The single-destinationed data flow task writes about 47.84 records to CRM server per second (54.27 records/s when 32-bit runtime is used), you may use this as a baseline rate if you want to compare yours with mine.
Summary
- BDD improves the data load performance by taking advantage of the multi-threading capability of SSIS engine.
- You should carefully choose a right number of the outputs for BDD component. It's not the case that the more the better. Depending on your servers' capacity (including processor, memory, IO system) and the network latency between your client system and CRM server, it could be 3, 5, 10, or something else for the maximized performance, which you may find out by running different tests.
- There are many ways that you can use to improve the data load performance, BDD is just one of the easy ways that make the data load faster, which is the main topic that we are trying to cover in this blog post.
If you are interested in any of the data flow tasks or sample data, please feel free to let me know, so that I can send you the SSIS package.
Thanks for reading.