Effectively Pivoting data–analysis (part 2)

pivot_timeconsumption2

This diagram shows the Pivot Table’s time consumption after the first optimizations. So let’s have a closer look at the CalculatePivot method.

Analyze CalculatePivot

In that method the Pivot Table is generated. That means that values of one column are combined into multiple columns based on an identifier, forming a new DataTable.

The method is iterating over all loaded data, to create rows based on values in PARTID-column and then set the target columns value.
As known from first step that can be joined with that loop inside FetchData method.
The return table structure is currently built after receiving all data, but if the data should be filled at receipt, that must also be done there.
After some more small modifications the time consumption is still rather high and there is only one part left I haven’t touched yet.

At the Beginning of CalculatePivot method it is checked whether an row already exists or it has to be created. The DataTable.Select() method seems to be rather inefficient at finding rows by it’s primary key, especially because it is returning a collection of DataRows.
It is recommended to query primary keys with on of the DataTable.Rows.Find() methods, and well, I have to say it’s worth it.
But is that really the fastest way to get a DataRow? No!

You can get a approximately 4-times boost by tracking your primary key information inside a separate dictionary and query against that.
But don’t forget to remove the primary key definition from data table to get the full benefit.

Following test results are based on building a DataTable with 10.000 rows, querying each 100 times.

DataTable.Select() 40767 ms
DataTable.Rows.Find() 21980 ms
Dictionary with Primary Key 20239 ms
Dictionary without Primary Key 11637 ms

This data shows, that you shouldn’t always rely on the methods provided by .NET Framework. There is often a compromise to be agreed when optimizing method for multiple possible usages. A self implemented method, specialized on that what you currently need, is often much faster.

Interestingly the same applies on DataTable.Columns.Contains(string key) method, the own caching solution is a bit faster than the provided Framework method.

pivot_timeconsumption3

A new comparison shows that only two of the initial methods are left. By moving the major parts into FetchData method, 99% of time used, is spend there. But compared to the first implementation of Pivot Table process, it completes in only 2.4% of time. That equates to a 40-times improvement.

Effectively Pivoting Data–Analysis

Before you start thinking about possible optimizations, fetch some performance data. My experience told me that all kind of considerations would be totally useless before you have gathered some performance analysis.

I used the existing methods as coarse orientation points, by measuring their execution time for 10000 parts each with 100 values.

pivot_timeconsumption

The major part of execution time (86.1%) is used by GetColumns method, so I’d expect the best optimization potential there.

The methods I am analyzing can be found here: https://github.com/dwonisch/PivotExample/blob/master/src/PivotExample/Pivot.cs

Analyze GetColumns

First step would be to analyze the method’s expected behavior. This method should provide the information to create the Pivot Table Columns and is using the configuration and the completely fetched set of data.

What you can see is, that the method is iterating through all data rows, in this example there are 1 million, checking all configured columns. That sums up to 100 million iterations to get a total of 101 columns returned.

So what are the advantages of iterating over all data rows. There is only one I could think of: Not showing columns were no data would be available for. In my opinion the complete method can be skipped, because the Pivot Columns are already defined by the user and it would be more appropriate to show an empty column than to hide it.

But lets go on with the decision to hide the columns, there is another place were we are iterating through all data already, the FetchData method. Maybe we could get a performance boost by refactoring the GetColumns method, so that it can be used inside the FetchData iteration.
This change fastened the complete execution up, whoops, it actually decreased performance by 14% (what I didn’t really expect). The next part of that method is, that the configuration parameter is iterated over multiple times. Replacing it with a dictionary should fasten it up a little bit and I was right, introducing dictionaries reduced query time by 87%.

As you can see the processing time from GetColumns moved into FetchData and could be greatly reduced. Now it’s time to have a look at the next method.

pivot_timeconsumption2

Effectively pivoting data

Select data from database and transpose it into one line per part containing several data.

That process is better known as Pivot Table. The principle is not that exciting, once understood, but it is hardly done efficiently. One of our last tasks was to create a report over all production data for each part regarding a configuration that defines data to use and its sort order.

The next picture will show an example of how the data looks in database and what it should look like in report view.

pivot1

I’ve highlighted the corresponding values in both tables to give you an idea of where each value from the left is used on the right side.

Development

Your task as a developer would be to transform the data as fast as possible using a decent amount of memory.

It’s hard to tell what went wrong during the first implementation, but during review it wouldn’t meet my expectations.
I’ll try to describe the process in a few words:

  1. Query data using SqlDataReader converting each record into an object and storing all objects in a List<DataObject>.
  2. Creating Pivot data structure by looping through all DataObject and fetching all VALUEID  values.
    Storing all results in a List<ColumnDefinition> containing display name and sort order information.
  3. Calculating Pivot Table by looping all DataObject entries creating a DataTable ReturnTable.
  4. Looping through all cells in ReturnTable and replace null values with “<No data available>”.
  5. Display ReturnTable as Pivot Report.

That code caused a lot of troubles, neither performance nor memory consumption were acceptable. In my test setup the test began to fail for 10 million data entries.

pivot_nonoptimized_results

The code used for my tests can be found here: https://github.com/dwonisch/PivotExample

Optimization

I’ll try to optimize that code and show up some things that went wrong in the next parts of this series.

A New start After a Little Break

The last year brought a lot of time consuming changes, so I neglected my blogging (and also other online activity) a little bit.
Anyway, I had a lot of new project ideas, blog entries and other tasks to do, queued in my Trello lists. So my brain wasn’t standing still.

One cause was the birth of my beautiful little girl. It was something new to learn, from now on I’ll be responsible for that little girl for the next few years. It is also a lot of fun to spend some time with her and watching her grow.

The other great change in my life was at work. I got the chance to lead our development team at AUTFORCE Automation GmbH and took the responsibility of the whole development process. I have a lot of new activities to fulfill that must be mastered over time. So also a lot of my spare time went into preparation for work.

Now after a few day with less work and a little spare time, while my little girl is sleeping, I found some time to write some articles and hopefully will find some time in future for even more.

Your users aren’t SuperUsers

We recently shipped a piece of software that checks the software version of a specific hardware part. The user can set a parameter that represents a target software version string, for each different type of hardware.
When the user creates a new parameter set, let’s say for ‘transfer case type B’, the parameter’s default value would be an empty string.
The decision was made to check the software version by using a regular expression, although the requirement didn’t specify a dynamic software check.
Almost at the end of commissioning phase I was on that site to support a team member that was hunting an issue whit serial number validation. I had a look into the log files and found a strange line.

Checking software version: Comparing target value ‘’ to actual value ‘7505’. Software version is valid.

I took my colleague in to make him aware of a possible bug in this check, but got a devastating answer:

The check is done via regular expressions. When I check for an empty expression any string would match, if this wasn’t intended, it is the operator’s fault.

That wasn’t sort of the answer I expected to hear, because the user was taken into responsibility for something we possibly messed up. There aren’t a lot of developers that are capable of regular expression syntax (without looking for the syntax on the internet), but we are expecting from our users to know how to write regular expressions to match their software versions. This feature definitely won’t increase usability of our software as it was intended to.

var targetVersion = "";
var actualVersion = "7505";

Regex.IsMatch(actualVersion, targetVersion);

The code above will always return true, no matter what actualVersion will be. Since this is the default value, the check will always pass and faulty hardware might be delivered. I am supporter of software working by default when nothing is configured, but in this case it would be better to fail and point out an incorrect parameter.

var targetVersion = "1|2|3";
var actualVersion = "1";

Regex.IsMatch(actualVersion, targetVersion);

We were using another regular expression in another context and encountered a weird behavior. The above Code was working for about two days, but after that it started to fail. The hardware type changed and therefore the software version, the new value was 15. So everyone expected the check to not match, but it did. Since there are no start and end tags in the regular expression, it will match when the target string is anywhere inside the actual value. What the developer really wanted to check was ‘^(1|2|3)$’.

These where only two of the problems that drove me to write this blog post and there are many others that might occur when using regular expressions as user input. Most developers aren’t capable of thinking of all the problems that might occur when using regular expressions as user defined input, so why would you expect users to be? My advice is to stay with simple string comparisons until the requirement is given to use a dynamic check, like regular expression would be, but not without the right tool to write those expressions.

Your users aren’t SuperUsers, so please don’t treat them like that.

Making your code a better place