Re: Reading & Searching a Huge file by Ahmad
Ahmad
Sat Mar 15 04:36:29 CDT 2008
Thanks Peter & Jon,
I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option left
is binary search.
Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. Like.
....... ..............................
2073 SANZP311800Q729T070534N17CEP
2345 SBHPP3200.00Q16000T070534N18CEPXT
2345 SBHPP3200.00Q1000T070534N19CEP
2345 SBHPP3200.00Q5000070534N20CEPTP
....... ..............................
....... ..............................
2345 SLHGP375.00Q4000T070534N26CEP
2567 SBHPP3200.0Q000T070534N23CEP
........ ..............................
Now if I have to find the first occurance of a numeric value 2345. In that
case binary search will help or not?
Regards,
"Peter Duniho" <NpOeStPeAdM@nnowslpianmk.com> wrote in message
news:op.t70rb3ir8jd0ej@petes-computer.local...
> On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
> <ahmaddearNO@SPAMhotmail.com> wrote:
>
>> [...]
>> Now what I want to is to find a specific numeric value at the beginning
>> of
>> the line. For example in the above few lines the numeric values are
>> 52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
>> sorted). Suppose I have to search 168. Since the values are sorted I can
>> use
>> the binary search technique to search a specific numeric value. But for
>> that
>> I will have to read all the numeric values sequentially into some array
>> in
>> memory. The other way is sequential search i.e to read line by line and
>> fetch the numeric value before [space] and compare it with required one.
>>
>> What is the best possible way to perform searching in the above mentioned
>> file format.
>
> How many times do you need to perform this search?
>
> If it's just once, then I think you could take advantage of the fact that
> the lines are all _almost_ the same length and do a binary search on the
> file itself. You'll have to make a guess, then search for the
> previous/next end-of-line, then check the numeric value at the beginning
> of that line. That's not as efficient as it could be if you had identical
> line lengths, but compared to scanning through a 2GB file it should still
> be plenty fast.
>
> If you're going to perform the search on a regular basis, I'd recommend
> creating an index for the file. Scan it once, storing just the initial
> numeric value and a byte offset within the file representing where that
> record's line starts. Then later you can do your binary search on the
> index instead of the file. This is, I think, what you're suggesting when
> you wrote "read all the numeric values sequentially into some array"; you
> didn't mention also storing the file offsets for each record, but
> hopefully you realize that's implied since otherwise keeping the values in
> an array wouldn't be useful.
>
> Of course, keep in mind that this is the sort of thing that databases are
> really good at. If you have a way to redirect the data into a database
> rather than keeping it in this huge text file, I think that would be the
> best way to handle it.
>
> Pete