This project is read-only.
2
Vote

Fill Method Very Sluggish - Proposing ExtremML adding a "Fast Load" option

description

Love this framework and the power it has to offer! Is there any way that the ExtremeML team can implement a "Fast Load" option within the Fill functionality that will bypass all of the looping iterations that do the extensive checks on columns & rows? This will really help when loading large ADO.NET datasets to Excel sheets.
Attached is a snippet of what I've been leveraging for a high performance export method using a Sql Data Reader and writing to an Excel worksheet.

file attachments

comments

TimCoulter wrote Apr 27, 2010 at 6:52 AM

Many thanks for posting your code.

I am certainly inspired by the idea of a of a technique that will enable Fill() to work with large datasets (and it would be great to see your benchmarks, just to emphasize the value of this initiative) but there are a couple of differences between your approach and that taken by the Fill() method, which we will need to resolve.

The most obvious change is that your algorithm is not table-oriented, it assumes an empty worksheet and appends directly to it, whereas Fill() finds an existing template table and extends it to the required size, then inserts new rows into the worksheet to suit. Fill() has been designed to work with templates, so we must assume that a template designer might have included data below the table template, which would result in invalid data after applying your logic. It may be possible to combine the two approaches, whereby the code first checks the template for compatibility with appending and then uses the appropriate method to apply new rows.

My other concern is the mechanism by which values are applied to cells. ExtremeML currently uses a fairly well-tested mechanism for coercing different data types into Excel-compatible values and I am reluctant to bypass this, as it will represent the beginnings of code redundancy (and the maintenance nightmare that goes with that).

So, I need to look closely at how the existing infrastructure can be extended to incorporate your ideas without sacrificing its strengths. I will post an update here as soon as some progress has been made.

In the meantime, it would be great to see those benchmarks :)

jstone923 wrote Apr 27, 2010 at 4:56 PM

Tim,
Thanks for such a timely response! I'm excited that you are even considering the idea of what I'm proposing. I'm sure my sloppy code tells all that I'm wet behind the ears regarding OOXML techniques. This solution was based upon a last minute request to shred through multiple Sql Data Readers and paste results into new Excel Worksheets, with rowcounts in the thousands... So you can imagine the look on my face when I stumbled upon ExtremeML!

So, please know that I do appreciate the carefulness of your value-to-cell implemenation, and the time it saves with maintenance. I will see if I can produce a "benchmark" rather than just word of mouth. I'm currently short on bandwidth and underwater in a migration project, but I can tell you that the problem seemed to be exponential (as more rows were added, the slowness increased), which makes sense after you explained to me the carefulness of the logic behind the Fill() method. The dataset I would use to benchmark would be 300 columns, 5000 rows..

Thanks again for your detailed response. This has been very helpful!

-- Jordan

TimCoulter wrote Apr 27, 2010 at 8:53 PM

Jordan,

I definitely wouldn't say "wet behind the ears" - the fact that you have managed to populate a worksheet from a data source is a great achievement. I remember spending many hours the first time I tried this. The fact that it performs well is even more encouraging.

You are right that the performance degradation with the Fill() method is non-linear, which I assume is because each row insertion requires some kind of validation on each of the growing set of existing rows. However, there is no explicitly coded behavior in ExtremeML that would explain this degradation, so I have always assumed that it was an unavoidable consequence of invoking row insertion via the SDK. Your sample seems to prove that appending rows is substantially more efficient, and therefore offers the possibility of a fast Fill() method, provided there are no existing rows below the insertion point.

The reason I asked for benchmarks was not simply to create work for you, but because I have seen a number of forum posts that suggest that there are inherent performance problems in the SDK (which is partly why I have not invested any effort in trying to solve the problem). If your benchmarks show that your approach really is viable, it counters the previous evidence and fully justifies finding a way to make your technique work in ExtremeML.

Thanks again for your input and I look forward to your further updates.

jstone923 wrote Apr 30, 2010 at 8:54 PM

Tim,

Hopefully I will be able to get on the benchmark early next week....Hope to post then....

TimCoulter wrote May 1, 2010 at 8:13 AM

Thanks Jordan - in fact I have also been looking closely at this issue and profiling ExtremeML to pin down the performance hot spots. Although I am using your code sample as inspiration, I am trying as hard as possible to implement it without compromising the original architecture of ExtremeML. To be honest, it's a tough challenge, as I had forgotten just how many consequential actions are related to a table Fill() operation - there are plenty, and they all contribute something to the performance problem. One area that seems to offer potential is using keyed collections instead of Linq lookups, especially for the Row and Cell collections, which are intensively accessed during Fill(). Work is progressing well, but there are still a number of obstacles to overcome. I'll post progress soon.

TimCoulter wrote May 5, 2010 at 10:24 AM

Jordan,

To bring this issue up-to-date, I have spent I the last 4 days implementing a variant of your idea for performance improvement, but it has all been rather a waste of time. I started by profiling a table Fill() operation and it appeared that most of the cycles were being used on activities related to collection management. Specifically, most collections in ExtremeML are accessed using lambda expressions in Linq extension methods, which is very susceptible to degradation with large collections. So, I decided to make a significant change to the collection base classes, using a variant of .Net's KeyedCollection class to provide fast lookup. Unfortunately, after implementing more than 200 code changes (and all the work needed to fix the side effects) it has produced less than 10% performance improvement. This is very disappointing, but it seems to confirm my original suspicion that the overall performance is limited by constraints in the OpenXML SDK. It would be reassuring to be proved wrong by your benchmarks but, for the time being, I am inclined to put this effort on hold.

jstone923 wrote May 5, 2010 at 2:47 PM

Tim,

I am honored that you spent the time you did this week to attempt possible "Fast Fill" implementation into ExtremeML. When time permits, I will try to come up with the benchmark numbers you have been asking for, but in no way is my mentality to "prove you wrong". I appreciate the level of detail in your explanations as to why my suggested Fill() feature does not fit in with the functionality and behavior of ExtremeML. Your framework is rich and very powerful and I will continue to leverage it! In the meantime, I will continue to leverage my custom fast load logic for forward-only, non-lookup sheet fills containing high row counts so I can keep management off my back!

jstone923 wrote Jun 1, 2010 at 10:16 PM

Tim,
When time permits, (add the DocumentFormat.OpenXml.dll to this project (it was too large to include in this attachment), apply the database scripts and run the results. As row counts exceed 100, you will notice extreme differences in elapsed time to populate the excel documents.

wrote Jun 1, 2010 at 10:16 PM

anandm wrote Feb 28, 2011 at 5:29 PM

Tim,
I was researching ways to use the OpenXML SDK in real life and came upon your library. It does make it easier to use OpenXML. I also came across http://openexcel.codeplex.com/, which seems to do something similar, but talks about being fast. I have not done any benchmarking and maybe it does not fit into how ExtremeML works, but it maybe worth a look to see if that can give you some insight into making Fill faster..

Anand

deepankar wrote Apr 27, 2011 at 8:49 AM

Hi
In the samples source code, i was running the funciton "PopulateCountriesFromDataSet()", this works perfectly fine, but if i increase the number of rows to 4,000 then the system gets hung and the CPU shows 100% usage? Any solution

wrote Jun 2, 2011 at 3:59 PM

wrote Feb 14, 2013 at 3:16 AM