I recently meet with the development team to evaluate a poorly performing application. I started out by having them explain what it was supposed to do and how it worked. The application is supposed to request data from a service, insert that data into the database and then send some data back to the service. The application is doing some synchronization between the client and vendor systems. The process is reported as taking a considerable amount of time (as in days and weeks) to synchronize 200 accounts. This was an obvious red flag and why it needed to be evaluated.

So I started asking my usual questions, “How does it work?”, “How does it really work?” (yes those are two different questions). After a high level review (because usually developers don’t want to expose their design decisions to criticism you end up with broad answers) I break down the process into smaller chunks and ask specific questions about each step. After about 20 minutes I found that the a large set of data was coming down from the service and 1-by-1 each record was being inserted (with a blocking call) into the database. After that, 1-by-1 data was being sent back to the service.

Already at this point I see a major issue, but I need more details to confirm. So I ask, are the records in the dataset dependent upon each other or do they have any relationship? The answer (as I already knew) was no, all of the records were individual. Then I asked, what data is being sent back to the service? The answer was a key for the record. The entire process is

  1. Get dataset from vendor’s service
  2. Insert each record into database, generating a key
  3. Send the key for the record back to the vendor’s service

It’s a fairly simple process, so why is it taking so long to execute? The answer starts with the numbers. Each account that needs to be synchronized has about 10,000-30,000 records. But that really isn’t a whole lot especially since there aren’t any computational aspects, just simple CRUD. With the details we have now, we know that each record has to be inserted into the database, the key has to be retrieved and then it has to be sent back to the service. We know that each record is processed 1-by-1 using blocking calls.

If each record takes average of ~500ms and the account has 10,000 records, that’s 5,000 seconds or 1.3 hours per account. With 200 accounts that’s 11 days. Since the the record is being inserted using a stored procedure that inserts the row then returns the key, which is then sent back to the service, we can quickly shave some time off by making that call to the stored procedure asynchronous. The call would no longer be blocking and when the callback was invoked, it could then send the key back to the service while not bothering the other records. No more blocking.

Alright, that’s a step in the right direction. Or is it? It isn’t. Think about it this way, there are 10,000 individual transactions that need to occur on the database which means connections, locking, constraint checking and transaction management. Some of this stuff has to occur regardless, but not all of it. If the handshake part of each call takes ~150ms, that’s 25 minutes of just handshakes. Wasted time. Even though making the calls asynchronous will reduce that time, it’s still going to take far too long and will consume resources unnecessarily. Here is a chance to start shaving time and reducing resource consumption. But how?

SQL Server supports bulk inserts. Bulk inserts are a way of inserting many records into a table at one time. What does this mean? Time savings. Bulk inserts work with a single connection, a single table lock (usually) and disables constraint checking to achieve it’s performance. 10,000 records can be inserted into a table in fractions of a second. Let’s review

  1. Get dataset from vendor’s service
  2. Bulk insert records into database
  3. Retrieve newly inserted keys
  4. Send each record key back to the service

At this point we can make the calls to the service asynchronous to shave some time off. However, I asked some more questions, “Does the service have any throttling or connection limits?”. The answer was yes, there are limits. So I then ask, “Does the service support batching or bulk submissions?”. The answer was yes. So there you have it. We can apply the same logic to the HTTP handshake as we did with the database handshake. It takes time to resolve the DNS, make the connection and start the transfer. Even though these operations are amazingly fast when you look at them individually, multiply that amazingly fast time by 10,000 and it becomes a different picture.

Reducing or removing the cost of the handshake by using bulk operations can significantly reduce overall execution time. Without using multi-threading or other concurrent execution techniques, we’ve reduced the execution time from 1.3 hours to < 3 minutes per account (depending on vendor’s connectivity limitations). That’s still 10 hours for the over all process but it beats waiting 11 days. We can further reduce this by running more than one account at once, but this will depend on the vendor’s service.

I’d like to hear about processes you’ve optimized and how you did it.