I have a class library that does calculations (usually AVERAGE) on
(sometimes) large sets of data. The nature of the application is that it
normally does several calculations (5-50 or more depending on user requests)
on the same set of data, so the basic implementation is that I grab all the
data out into a DataTable and then the library takes arguments and uses the
.Compute() method of the DataTable to return results. In most cases, I'm
doing an average of a column on a subset of rows. Subsequent calls do
averages on different subsets of rows.
This all works fine until the DataTable grows to 10k+ rows. Once I pass
that threshold (or thereabouts), the performance goes into the toilet. It
can literally take minutes to pull reports that would take < 10 seconds on a
dataset half the size.
The problem is that I don't want a single request to result in 50+ queries
to my database, yet I must do 50+ calculations. I can add application
servers pretty easily, but I only have the 1 database server. That's my
scarce resource so I want to minimize his work (even though he's better at
doing aggregation that my app server) and put my app servers to work.
I am well aware of the limitations of the DataTable and I am also well aware
that if I run these same queries directly against the database, I get much
better performance (orders of magnitude better). I'm just not sure how to
overcome this.
So my question is how can I pull out possibly large sets of data and do
calculations on the application servers so as not to overload my database
server but avoid this huge performance penalty that I get when working with
large DataTables? I'm willing to give up some performance for smaller sets
of data if necessary (the tradeoff would be worth it, I'm sure, if I can
figure out how to optimize the top end).
I have a small benchmarking application that I'm working with and I have yet
to find any method that out-performs DataTable.Compute(). I'm working under
the assumption that I need to start w/ a DataTable because I will need to
access the data several times (maybe I'm hosed right there - that's why I'm
asking this question).
Here is what I've tried thus far:
1) Manually call DataTable.Select() and get a DataRow[] object - brute force
iterate and do the average
2) Use LINQ against an EnumerableRowCollection<DataRow> pulled from the
DataTable and call .Average()
3) Brute force iterate over the DataTable myself and do the average
4) Call DataTable.CreateDataReader() and brute force iterate on the
DataReader to do the average
Of all of these methods, #1 results in the closest performance numbers to
.Compute() (and it's pretty darn close) and #2 results in the worst (almost
twice as bad) with #3 & #4 falling solidly in the middle of those two.
Does anyone out there have an idea on how to solve this problem?
Many thanks -
Paul Prewett