Tag Archives: C#

Effectively Pivoting data–analysis (part 2)

pivot_timeconsumption2

This diagram shows the Pivot Table’s time consumption after the first optimizations. So let’s have a closer look at the CalculatePivot method.

Analyze CalculatePivot

In that method the Pivot Table is generated. That means that values of one column are combined into multiple columns based on an identifier, forming a new DataTable.

The method is iterating over all loaded data, to create rows based on values in PARTID-column and then set the target columns value.
As known from first step that can be joined with that loop inside FetchData method.
The return table structure is currently built after receiving all data, but if the data should be filled at receipt, that must also be done there.
After some more small modifications the time consumption is still rather high and there is only one part left I haven’t touched yet.

At the Beginning of CalculatePivot method it is checked whether an row already exists or it has to be created. The DataTable.Select() method seems to be rather inefficient at finding rows by it’s primary key, especially because it is returning a collection of DataRows.
It is recommended to query primary keys with on of the DataTable.Rows.Find() methods, and well, I have to say it’s worth it.
But is that really the fastest way to get a DataRow? No!

You can get a approximately 4-times boost by tracking your primary key information inside a separate dictionary and query against that.
But don’t forget to remove the primary key definition from data table to get the full benefit.

Following test results are based on building a DataTable with 10.000 rows, querying each 100 times.

DataTable.Select() 40767 ms
DataTable.Rows.Find() 21980 ms
Dictionary with Primary Key 20239 ms
Dictionary without Primary Key 11637 ms

This data shows, that you shouldn’t always rely on the methods provided by .NET Framework. There is often a compromise to be agreed when optimizing method for multiple possible usages. A self implemented method, specialized on that what you currently need, is often much faster.

Interestingly the same applies on DataTable.Columns.Contains(string key) method, the own caching solution is a bit faster than the provided Framework method.

pivot_timeconsumption3

A new comparison shows that only two of the initial methods are left. By moving the major parts into FetchData method, 99% of time used, is spend there. But compared to the first implementation of Pivot Table process, it completes in only 2.4% of time. That equates to a 40-times improvement.

Effectively Pivoting Data–Analysis

Before you start thinking about possible optimizations, fetch some performance data. My experience told me that all kind of considerations would be totally useless before you have gathered some performance analysis.

I used the existing methods as coarse orientation points, by measuring their execution time for 10000 parts each with 100 values.

pivot_timeconsumption

The major part of execution time (86.1%) is used by GetColumns method, so I’d expect the best optimization potential there.

The methods I am analyzing can be found here: https://github.com/dwonisch/PivotExample/blob/master/src/PivotExample/Pivot.cs

Analyze GetColumns

First step would be to analyze the method’s expected behavior. This method should provide the information to create the Pivot Table Columns and is using the configuration and the completely fetched set of data.

What you can see is, that the method is iterating through all data rows, in this example there are 1 million, checking all configured columns. That sums up to 100 million iterations to get a total of 101 columns returned.

So what are the advantages of iterating over all data rows. There is only one I could think of: Not showing columns were no data would be available for. In my opinion the complete method can be skipped, because the Pivot Columns are already defined by the user and it would be more appropriate to show an empty column than to hide it.

But lets go on with the decision to hide the columns, there is another place were we are iterating through all data already, the FetchData method. Maybe we could get a performance boost by refactoring the GetColumns method, so that it can be used inside the FetchData iteration.
This change fastened the complete execution up, whoops, it actually decreased performance by 14% (what I didn’t really expect). The next part of that method is, that the configuration parameter is iterated over multiple times. Replacing it with a dictionary should fasten it up a little bit and I was right, introducing dictionaries reduced query time by 87%.

As you can see the processing time from GetColumns moved into FetchData and could be greatly reduced. Now it’s time to have a look at the next method.

pivot_timeconsumption2

Effectively pivoting data

Select data from database and transpose it into one line per part containing several data.

That process is better known as Pivot Table. The principle is not that exciting, once understood, but it is hardly done efficiently. One of our last tasks was to create a report over all production data for each part regarding a configuration that defines data to use and its sort order.

The next picture will show an example of how the data looks in database and what it should look like in report view.

pivot1

I’ve highlighted the corresponding values in both tables to give you an idea of where each value from the left is used on the right side.

Development

Your task as a developer would be to transform the data as fast as possible using a decent amount of memory.

It’s hard to tell what went wrong during the first implementation, but during review it wouldn’t meet my expectations.
I’ll try to describe the process in a few words:

  1. Query data using SqlDataReader converting each record into an object and storing all objects in a List<DataObject>.
  2. Creating Pivot data structure by looping through all DataObject and fetching all VALUEID  values.
    Storing all results in a List<ColumnDefinition> containing display name and sort order information.
  3. Calculating Pivot Table by looping all DataObject entries creating a DataTable ReturnTable.
  4. Looping through all cells in ReturnTable and replace null values with “<No data available>”.
  5. Display ReturnTable as Pivot Report.

That code caused a lot of troubles, neither performance nor memory consumption were acceptable. In my test setup the test began to fail for 10 million data entries.

pivot_nonoptimized_results

The code used for my tests can be found here: https://github.com/dwonisch/PivotExample

Optimization

I’ll try to optimize that code and show up some things that went wrong in the next parts of this series.