I have a MS SQL Server database table that contains 5 million records. The
primary key of the table is a GUID.

I need to write an application that reads each of these records and passes
some values to another process.

What's the best way to retrieve the records to be processed?

Reading the records into a DataTable would probably result in far too much
memory usage.

Does using a DataReader pass the load (and hence memory problem) onto SQL
Server instead?

Reading batches of records would be difficult as the primary key is a GUID
and is not sequential.

What about reading the primary key values into a DataTable and then reading
batches of records?

Has anyone had to solve a similar problem?

Thanks,
Peter

Re: Best way to process millions of records by John

John
Tue Jan 04 09:34:44 CST 2005

"Peter Sedman" <psedman@community.nospam> wrote in message
news:%23vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
>I have a MS SQL Server database table that contains 5 million records. The
> primary key of the table is a GUID.
>
> I need to write an application that reads each of these records and passes
> some values to another process.
>
> What's the best way to retrieve the records to be processed?

I would just use a DataReader and trust SQL Server and ADO.NET to implement
reasonable buffering.

On the other hand, if you have control of the code that sends to the other
process, you may want to batch up what you send to that process, or at least
implement a buffering scheme.

John Saunders



Re: Best way to process millions of records by User

User
Tue Jan 04 10:03:36 CST 2005

I don't see why the "non-sequential" matters.
If you need to read in some kind of order, do an "ORDER BY" in SQL Server
and then use a DataReader.

"Peter Sedman" <psedman@community.nospam> wrote in message
news:#vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
> I have a MS SQL Server database table that contains 5 million records.
The
> primary key of the table is a GUID.
>
> I need to write an application that reads each of these records and passes
> some values to another process.
>
> What's the best way to retrieve the records to be processed?
>
> Reading the records into a DataTable would probably result in far too much
> memory usage.
>
> Does using a DataReader pass the load (and hence memory problem) onto SQL
> Server instead?
>
> Reading batches of records would be difficult as the primary key is a GUID
> and is not sequential.
>
> What about reading the primary key values into a DataTable and then
reading
> batches of records?
>
> Has anyone had to solve a similar problem?
>
> Thanks,
> Peter
>
>



Re: Best way to process millions of records by W

W
Tue Jan 04 10:31:37 CST 2005

It kinda depends on what the other process is. In general - I would say the
DataReader is definitely preferrable to the DataTable in this case since you
don't appear to need to serialize the data or keep it's state after you pass
it off. To this end - you will definitely save some resources - big time
with that much data.

However, if you are processing 5 million records- I'd look to DTS if
possible - much better suited to many tasks involving that much data - again
though it depends on the nature of the processes you are passing the data
to.

5 million records is a lot to work with in any environment - if you're
processing them all in one process. You may want to look at breaking down
the records into smaller more manageable sets - for many reasons - primiarly
if somethign goes wrong - you can notify and respond a lot quicker if you
break it down into smaller sets.

HTH,

Bill

--
W.G. Ryan, MVP

www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"Peter Sedman" <psedman@community.nospam> wrote in message
news:%23vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
> I have a MS SQL Server database table that contains 5 million records.
The
> primary key of the table is a GUID.
>
> I need to write an application that reads each of these records and passes
> some values to another process.
>
> What's the best way to retrieve the records to be processed?
>
> Reading the records into a DataTable would probably result in far too much
> memory usage.
>
> Does using a DataReader pass the load (and hence memory problem) onto SQL
> Server instead?
>
> Reading batches of records would be difficult as the primary key is a GUID
> and is not sequential.
>
> What about reading the primary key values into a DataTable and then
reading
> batches of records?
>
> Has anyone had to solve a similar problem?
>
> Thanks,
> Peter
>
>



Re: Best way to process millions of records by William

William
Tue Jan 04 13:13:01 CST 2005

I would take a look at server-side processing in this case--5 million rows
is a lot. Asking the server to sort them first can be an issue--especially
if there is not already an index in place. Once SQL CLR is here, it will be
easy (easier) to code a sophisticated routine to process the rows. At this
point, I would try to do everything I needed to do in a SP. Once the SP has
done its work, it can write the rows to a temporary table (hopefully a
subset of the original) and THEN you can use BCP/DTS to move it somewhere.
Asking ADO.NET (even with a data reader) to move the rows to the client is a
waste of time and resources (IMHO).

--
____________________________________
William (Bill) Vaughn
Author, Mentor, Consultant
Microsoft MVP
www.betav.com
Please reply only to the newsgroup so that others can benefit.
This posting is provided "AS IS" with no warranties, and confers no rights.
__________________________________

"W.G. Ryan eMVP" <WilliamRyan@gmail.com> wrote in message
news:eK25drn8EHA.3820@TK2MSFTNGP11.phx.gbl...
> It kinda depends on what the other process is. In general - I would say
> the
> DataReader is definitely preferrable to the DataTable in this case since
> you
> don't appear to need to serialize the data or keep it's state after you
> pass
> it off. To this end - you will definitely save some resources - big time
> with that much data.
>
> However, if you are processing 5 million records- I'd look to DTS if
> possible - much better suited to many tasks involving that much data -
> again
> though it depends on the nature of the processes you are passing the data
> to.
>
> 5 million records is a lot to work with in any environment - if you're
> processing them all in one process. You may want to look at breaking down
> the records into smaller more manageable sets - for many reasons -
> primiarly
> if somethign goes wrong - you can notify and respond a lot quicker if you
> break it down into smaller sets.
>
> HTH,
>
> Bill
>
> --
> W.G. Ryan, MVP
>
> www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
> "Peter Sedman" <psedman@community.nospam> wrote in message
> news:%23vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
>> I have a MS SQL Server database table that contains 5 million records.
> The
>> primary key of the table is a GUID.
>>
>> I need to write an application that reads each of these records and
>> passes
>> some values to another process.
>>
>> What's the best way to retrieve the records to be processed?
>>
>> Reading the records into a DataTable would probably result in far too
>> much
>> memory usage.
>>
>> Does using a DataReader pass the load (and hence memory problem) onto SQL
>> Server instead?
>>
>> Reading batches of records would be difficult as the primary key is a
>> GUID
>> and is not sequential.
>>
>> What about reading the primary key values into a DataTable and then
> reading
>> batches of records?
>>
>> Has anyone had to solve a similar problem?
>>
>> Thanks,
>> Peter
>>
>>
>
>



Re: Best way to process millions of records by Frans

Frans
Tue Jan 04 13:28:24 CST 2005

Peter Sedman wrote:
> I have a MS SQL Server database table that contains 5 million records. The
> primary key of the table is a GUID.
>
> I need to write an application that reads each of these records and passes
> some values to another process.

so your select should only return those columns.

> What's the best way to retrieve the records to be processed?

what's the process doing with the values? in this case, with a lot of
rows, server-side processing is often the only way to get a decent
performance. Unless that other process is not under your control of
course and simply needs to get fed by all 5 million rows.

> Reading the records into a DataTable would probably result in far too much
> memory usage.

yes, beyond 57,000 rows, the datatable is dead in the water.

> Does using a DataReader pass the load (and hence memory problem) onto SQL
> Server instead?

If you do a SELECT * from table and table contains 5 million rows,
Sqlserver will have to store the temp resultset somewhere. This is often
done in memory if the resultset is small, but with a large resultset, it
might be it uses the tempdb to store the cursor. (depends on the size of
the resultset, if the resultset is 30MB for example, it still might use
just memory to keep the resultset)

a datareader is in fact a server-side cursor. The sqlclient will
receive batches of data from the server and these are not that large. So
this could work in your situation.

> Reading batches of records would be difficult as the primary key is a GUID
> and is not sequential.
>
> What about reading the primary key values into a DataTable and then reading
> batches of records?

no can do. way too many rows.

Even though I'm generally against the usage of stored procedures in
many situations, in your particular case it suits the job I think, so in
short: best way to do this is to write a proc which does the processing
that other process is doing for you, which solves you from the
data-pumping you have to do otherwise.

Frans.

--
------------------------------------------------------------------------
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------

Re: Best way to process millions of records by Sahil

Sahil
Tue Jan 04 14:46:01 CST 2005

Peter here are my 2 cents.

In order of decreasing preference I would use.

1. Dataset/DataTable <---- totally stay away from this in this case.
2. Datareader
3. Cross linked queries
4. DTS
5. BCP <--- my recommendation

.. so essentially my topmost recommendation is BCP, but I wanna add 1 more
thing. GUID is a pig compared to Int32. See if you can do away with GUID on
a table with a million rows.

BTW, for a comparison, I wouldn't be surprised if BCP is about a 100 to a
1000 times faster than DataReader.

If the other process is something OTHER than another DB, then you should
look into DTS. ... What is the other process that expects these millionish
rows?

- Sahil Malik
http://dotnetjunkies.com/weblog/sahilmalik



"Peter Sedman" <psedman@community.nospam> wrote in message
news:#vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
> I have a MS SQL Server database table that contains 5 million records.
The
> primary key of the table is a GUID.
>
> I need to write an application that reads each of these records and passes
> some values to another process.
>
> What's the best way to retrieve the records to be processed?
>
> Reading the records into a DataTable would probably result in far too much
> memory usage.
>
> Does using a DataReader pass the load (and hence memory problem) onto SQL
> Server instead?
>
> Reading batches of records would be difficult as the primary key is a GUID
> and is not sequential.
>
> What about reading the primary key values into a DataTable and then
reading
> batches of records?
>
> Has anyone had to solve a similar problem?
>
> Thanks,
> Peter
>
>



Re: Best way to process millions of records by W

W
Tue Jan 04 18:26:38 CST 2005

Amen to that. Even on my dual processor box - 5 million records is a hog.

--
W.G. Ryan MVP (Windows Embedded)

TiBA Solutions
www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"William (Bill) Vaughn" <billvaRemoveThis@nwlink.com> wrote in message
news:eI9wpFp8EHA.824@TK2MSFTNGP11.phx.gbl...
> I would take a look at server-side processing in this case--5 million rows
> is a lot. Asking the server to sort them first can be an issue--especially
> if there is not already an index in place. Once SQL CLR is here, it will
be
> easy (easier) to code a sophisticated routine to process the rows. At this
> point, I would try to do everything I needed to do in a SP. Once the SP
has
> done its work, it can write the rows to a temporary table (hopefully a
> subset of the original) and THEN you can use BCP/DTS to move it somewhere.
> Asking ADO.NET (even with a data reader) to move the rows to the client is
a
> waste of time and resources (IMHO).
>
> --
> ____________________________________
> William (Bill) Vaughn
> Author, Mentor, Consultant
> Microsoft MVP
> www.betav.com
> Please reply only to the newsgroup so that others can benefit.
> This posting is provided "AS IS" with no warranties, and confers no
rights.
> __________________________________
>
> "W.G. Ryan eMVP" <WilliamRyan@gmail.com> wrote in message
> news:eK25drn8EHA.3820@TK2MSFTNGP11.phx.gbl...
> > It kinda depends on what the other process is. In general - I would say
> > the
> > DataReader is definitely preferrable to the DataTable in this case since
> > you
> > don't appear to need to serialize the data or keep it's state after you
> > pass
> > it off. To this end - you will definitely save some resources - big
time
> > with that much data.
> >
> > However, if you are processing 5 million records- I'd look to DTS if
> > possible - much better suited to many tasks involving that much data -
> > again
> > though it depends on the nature of the processes you are passing the
data
> > to.
> >
> > 5 million records is a lot to work with in any environment - if you're
> > processing them all in one process. You may want to look at breaking
down
> > the records into smaller more manageable sets - for many reasons -
> > primiarly
> > if somethign goes wrong - you can notify and respond a lot quicker if
you
> > break it down into smaller sets.
> >
> > HTH,
> >
> > Bill
> >
> > --
> > W.G. Ryan, MVP
> >
> > www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
> > "Peter Sedman" <psedman@community.nospam> wrote in message
> > news:%23vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
> >> I have a MS SQL Server database table that contains 5 million records.
> > The
> >> primary key of the table is a GUID.
> >>
> >> I need to write an application that reads each of these records and
> >> passes
> >> some values to another process.
> >>
> >> What's the best way to retrieve the records to be processed?
> >>
> >> Reading the records into a DataTable would probably result in far too
> >> much
> >> memory usage.
> >>
> >> Does using a DataReader pass the load (and hence memory problem) onto
SQL
> >> Server instead?
> >>
> >> Reading batches of records would be difficult as the primary key is a
> >> GUID
> >> and is not sequential.
> >>
> >> What about reading the primary key values into a DataTable and then
> > reading
> >> batches of records?
> >>
> >> Has anyone had to solve a similar problem?
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >
> >
>
>



Re: Best way to process millions of records by Sahil

Sahil
Wed Jan 05 17:46:58 CST 2005

>>Even on my dual processor box

<green> I'm not jealous, I'm not jealous </green>


"W.G. Ryan eMVP" <WilliamRyan@NoSpam.gmail.com> wrote in message
news:u#fLBzr8EHA.3236@TK2MSFTNGP15.phx.gbl...
> Amen to that. Even on my dual processor box - 5 million records is a hog.
>
> --
> W.G. Ryan MVP (Windows Embedded)
>
> TiBA Solutions
> www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
> "William (Bill) Vaughn" <billvaRemoveThis@nwlink.com> wrote in message
> news:eI9wpFp8EHA.824@TK2MSFTNGP11.phx.gbl...
> > I would take a look at server-side processing in this case--5 million
rows
> > is a lot. Asking the server to sort them first can be an
issue--especially
> > if there is not already an index in place. Once SQL CLR is here, it will
> be
> > easy (easier) to code a sophisticated routine to process the rows. At
this
> > point, I would try to do everything I needed to do in a SP. Once the SP
> has
> > done its work, it can write the rows to a temporary table (hopefully a
> > subset of the original) and THEN you can use BCP/DTS to move it
somewhere.
> > Asking ADO.NET (even with a data reader) to move the rows to the client
is
> a
> > waste of time and resources (IMHO).
> >
> > --
> > ____________________________________
> > William (Bill) Vaughn
> > Author, Mentor, Consultant
> > Microsoft MVP
> > www.betav.com
> > Please reply only to the newsgroup so that others can benefit.
> > This posting is provided "AS IS" with no warranties, and confers no
> rights.
> > __________________________________
> >
> > "W.G. Ryan eMVP" <WilliamRyan@gmail.com> wrote in message
> > news:eK25drn8EHA.3820@TK2MSFTNGP11.phx.gbl...
> > > It kinda depends on what the other process is. In general - I would
say
> > > the
> > > DataReader is definitely preferrable to the DataTable in this case
since
> > > you
> > > don't appear to need to serialize the data or keep it's state after
you
> > > pass
> > > it off. To this end - you will definitely save some resources - big
> time
> > > with that much data.
> > >
> > > However, if you are processing 5 million records- I'd look to DTS if
> > > possible - much better suited to many tasks involving that much data -
> > > again
> > > though it depends on the nature of the processes you are passing the
> data
> > > to.
> > >
> > > 5 million records is a lot to work with in any environment - if you're
> > > processing them all in one process. You may want to look at breaking
> down
> > > the records into smaller more manageable sets - for many reasons -
> > > primiarly
> > > if somethign goes wrong - you can notify and respond a lot quicker if
> you
> > > break it down into smaller sets.
> > >
> > > HTH,
> > >
> > > Bill
> > >
> > > --
> > > W.G. Ryan, MVP
> > >
> > > www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
> > > "Peter Sedman" <psedman@community.nospam> wrote in message
> > > news:%23vyVEAn8EHA.1452@TK2MSFTNGP11.phx.gbl...
> > >> I have a MS SQL Server database table that contains 5 million
records.
> > > The
> > >> primary key of the table is a GUID.
> > >>
> > >> I need to write an application that reads each of these records and
> > >> passes
> > >> some values to another process.
> > >>
> > >> What's the best way to retrieve the records to be processed?
> > >>
> > >> Reading the records into a DataTable would probably result in far too
> > >> much
> > >> memory usage.
> > >>
> > >> Does using a DataReader pass the load (and hence memory problem) onto
> SQL
> > >> Server instead?
> > >>
> > >> Reading batches of records would be difficult as the primary key is a
> > >> GUID
> > >> and is not sequential.
> > >>
> > >> What about reading the primary key values into a DataTable and then
> > > reading
> > >> batches of records?
> > >>
> > >> Has anyone had to solve a similar problem?
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >
> > >
> >
> >
>
>



Re: Best way to process millions of records by Cor

Cor
Thu Jan 06 03:07:57 CST 2005

Sahil,

Dual processor box does not say anything, they exist already for Windows
systems from the Pentium One time.

Therefore there exist even with 286/386/486 however as far as I remember me
they did nothing in Windows systems and exist of course in non Window OS's
with all types of processors.

Therefore what Bill says is the same as "I have a lot of Pk's" which can
mean really 2 horses before the car.

Cor



Re: Best way to process millions of records by Miha

Miha
Thu Jan 06 03:43:57 CST 2005

Hi Cor,

I am not sure that I entirely understand your post, however, Windows NT and
newer have support for dual processor systems and they take advantage of it.
And I really doubt that 386 and older had SMP capability. :-)

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
SLODUG - Slovene Developer Users Group
www.rthand.com

"Cor Ligthert" <notmyfirstname@planet.nl> wrote in message
news:OZ5F3788EHA.1300@TK2MSFTNGP14.phx.gbl...
> Sahil,
>
> Dual processor box does not say anything, they exist already for Windows
> systems from the Pentium One time.
>
> Therefore there exist even with 286/386/486 however as far as I remember
> me they did nothing in Windows systems and exist of course in non Window
> OS's with all types of processors.
>
> Therefore what Bill says is the same as "I have a lot of Pk's" which can
> mean really 2 horses before the car.
>
> Cor
>



Re: Best way to process millions of records by Cor

Cor
Thu Jan 06 04:36:01 CST 2005

Miha,

I don't know anymore if NT did work with dual 486, however I am sure it was
with Pentium 1, so I did not wanted to start to low.

:-)

Cor



Re: Best way to process millions of records by Cor

Cor
Thu Jan 06 04:39:59 CST 2005

Miha,

I used the wrong words

"Therefore" should be "Before that" I used automaticly the Dutch words "Daar
voor"

Sorry and thanks attending me on that.

Cor