|
Krishna
कृष्ण,LOL
Registered: 05/08/03
Posts: 23,285
Loc: oakland
|
random sample from large text-file?
#9904576 - 03/03/09 04:25 PM (15 years, 28 days ago) |
|
|
I need to take a random sample from a very large (~4gb) text-file. Obviously I don't want to read the whole file into memory in order to do this. My current algorithim is pretty simple - get the total size of the file, select a random integer between 0 and that size, seek to that byte in the file, get the next full line, and repeat until we've collected as many lines as needed.
e.g. (in ruby, though the algorithim itself is pretty language-neutral) Code:
def rand_seek(f) rbyte = rand(SIZE).to_i f.seek(rbyte) f.readline return f.readline end
begin SIZE = `stat -c%s /foo/bar/some_histogram.txt`.to_i histo = File.open('/foo/bar/some_histogram.txt') out = File.open('sample.txt','w+')
for i in (1..10000) line = rand_seek(histo) while line == nil line = rand_seek(histo) end out.write line end
histo.close out.close end
Now this seems efficient enough, taking just over a minute to build a random sample of 10000 from the ~4gb file (without certain checks - eg, to ensure we don't select the same random line twice) - however, I'm wondering if there is some more standard (eg, with some *nix command-line tool) way of doing this? I can't use something like shuffle.c as I don't want to read the entire file into memory, but rather just seek around it snatching lines at random...
Thanks for any input.
--------------------
|
Seuss
Error: divide byzero
Registered: 04/27/01
Posts: 23,480
Loc: Caribbean
Last seen: 1 month, 19 days
|
Re: random sample from large text-file? [Re: Krishna]
#9904788 - 03/03/09 04:53 PM (15 years, 28 days ago) |
|
|
Build an array of seek offsets first. Next, sort the array. Next, traverse the array and seek. This way, if you have multiple reads within the same read buffer, you only read from the filesystem for those seeks once. You also optimize your reads in order to help the OS if it is buffering read aheads. I would also ditch buffered IO and use memory mapped IO (mmap) or perhaps pread(). Save the output to another array, filling it in the proper "random" order, then write it all at once.
|
Krishna
कृष्ण,LOL
Registered: 05/08/03
Posts: 23,285
Loc: oakland
|
Re: random sample from large text-file? [Re: Seuss]
#9914046 - 03/04/09 10:59 PM (15 years, 27 days ago) |
|
|
Quote:
Seuss said: Build an array of seek offsets first. Next, sort the array. Next, traverse the array and seek. This way, if you have multiple reads within the same read buffer, you only read from the filesystem for those seeks once. You also optimize your reads in order to help the OS if it is buffering read aheads.
hell yes, definitely a smart idea. not that important when taking a 10k sample, but i'm sure the next step is going to be taking 1000 10k samples (without getting into the details of what this is for so i don't get fired, i'm doing a single sample to come up with any snags, etc and then passing off a bunch of samples with my notes to less technically-apt data-processing folks)
Quote:
I would also ditch buffered IO and use memory mapped IO (mmap) or perhaps pread(). Save the output to another array, filling it in the proper "random" order, then write it all at once.
also a smart tip. this current job is the first time i've had to work with really large data-sets (working in NLP/computational linguistics) and I still find, 9 months later, that i'm always forgetting certain basic things to keep in mind when dealing with huge amounts of data, and having to go back after i've run the code on a sample set and optimize stuff like this.
thanks for the input Seuss. I know this forum is generally used for algorithm discussion, but I figured (hoped) you if nobody else would chime in with some input.
--------------------
|
|