Effectively Pivoting Data–Analysis

Before you start thinking about possible optimizations, fetch some performance data. My experience told me that all kind of considerations would be totally useless before you have gathered some performance analysis.

I used the existing methods as coarse orientation points, by measuring their execution time for 10000 parts each with 100 values.

pivot_timeconsumption

The major part of execution time (86.1%) is used by GetColumns method, so I’d expect the best optimization potential there.

The methods I am analyzing can be found here: https://github.com/dwonisch/PivotExample/blob/master/src/PivotExample/Pivot.cs

Analyze GetColumns

First step would be to analyze the method’s expected behavior. This method should provide the information to create the Pivot Table Columns and is using the configuration and the completely fetched set of data.

What you can see is, that the method is iterating through all data rows, in this example there are 1 million, checking all configured columns. That sums up to 100 million iterations to get a total of 101 columns returned.

So what are the advantages of iterating over all data rows. There is only one I could think of: Not showing columns were no data would be available for. In my opinion the complete method can be skipped, because the Pivot Columns are already defined by the user and it would be more appropriate to show an empty column than to hide it.

But lets go on with the decision to hide the columns, there is another place were we are iterating through all data already, the FetchData method. Maybe we could get a performance boost by refactoring the GetColumns method, so that it can be used inside the FetchData iteration.
This change fastened the complete execution up, whoops, it actually decreased performance by 14% (what I didn’t really expect). The next part of that method is, that the configuration parameter is iterated over multiple times. Replacing it with a dictionary should fasten it up a little bit and I was right, introducing dictionaries reduced query time by 87%.

As you can see the processing time from GetColumns moved into FetchData and could be greatly reduced. Now it’s time to have a look at the next method.

pivot_timeconsumption2

Effectively pivoting data

Select data from database and transpose it into one line per part containing several data.

That process is better known as Pivot Table. The principle is not that exciting, once understood, but it is hardly done efficiently. One of our last tasks was to create a report over all production data for each part regarding a configuration that defines data to use and its sort order.

The next picture will show an example of how the data looks in database and what it should look like in report view.

pivot1

I’ve highlighted the corresponding values in both tables to give you an idea of where each value from the left is used on the right side.

Development

Your task as a developer would be to transform the data as fast as possible using a decent amount of memory.

It’s hard to tell what went wrong during the first implementation, but during review it wouldn’t meet my expectations.
I’ll try to describe the process in a few words:

  1. Query data using SqlDataReader converting each record into an object and storing all objects in a List<DataObject>.
  2. Creating Pivot data structure by looping through all DataObject and fetching all VALUEID  values.
    Storing all results in a List<ColumnDefinition> containing display name and sort order information.
  3. Calculating Pivot Table by looping all DataObject entries creating a DataTable ReturnTable.
  4. Looping through all cells in ReturnTable and replace null values with “<No data available>”.
  5. Display ReturnTable as Pivot Report.

That code caused a lot of troubles, neither performance nor memory consumption were acceptable. In my test setup the test began to fail for 10 million data entries.

pivot_nonoptimized_results

The code used for my tests can be found here: https://github.com/dwonisch/PivotExample

Optimization

I’ll try to optimize that code and show up some things that went wrong in the next parts of this series.

A New start After a Little Break

The last year brought a lot of time consuming changes, so I neglected my blogging (and also other online activity) a little bit.
Anyway, I had a lot of new project ideas, blog entries and other tasks to do, queued in my Trello lists. So my brain wasn’t standing still.

One cause was the birth of my beautiful little girl. It was something new to learn, from now on I’ll be responsible for that little girl for the next few years. It is also a lot of fun to spend some time with her and watching her grow.

The other great change in my life was at work. I got the chance to lead our development team at AUTFORCE Automation GmbH and took the responsibility of the whole development process. I have a lot of new activities to fulfill that must be mastered over time. So also a lot of my spare time went into preparation for work.

Now after a few day with less work and a little spare time, while my little girl is sleeping, I found some time to write some articles and hopefully will find some time in future for even more.