Home | Community | Message Board


This site includes paid links. Please support our sponsors.


Welcome to the Shroomery Message Board! You are experiencing a small sample of what the site has to offer. Please login or register to post messages and view our exclusive members-only content. You'll gain access to additional forums, file attachments, board customizations, encrypted private messages, and much more!

Shop: PhytoExtractum Kratom Powder for Sale, Maeng Da Thai Kratom Leaf Powder   Unfolding Nature Unfolding Nature: Being in the Implicate Order

Jump to first unread post Pages: 1
InvisibleKrishna
कृष्ण,LOL
 User Gallery

Registered: 05/08/03
Posts: 23,285
Loc: oakland
random sample from large text-file?
    #9904576 - 03/03/09 04:25 PM (15 years, 28 days ago)

I need to take a random sample from a very large (~4gb) text-file. Obviously I don't want to read the whole file into memory in order to do this. My current algorithim is pretty simple - get the total size of the file, select a random integer between 0 and that size, seek to that byte in the file, get the next full line, and repeat until we've collected as many lines as needed.

e.g. (in ruby, though the algorithim itself is pretty language-neutral)
Code:

def rand_seek(f)
rbyte = rand(SIZE).to_i
f.seek(rbyte)
f.readline
return f.readline
end

begin
SIZE = `stat -c%s /foo/bar/some_histogram.txt`.to_i
histo = File.open('/foo/bar/some_histogram.txt')
out = File.open('sample.txt','w+')

for i in (1..10000)
line = rand_seek(histo)
while line == nil
line = rand_seek(histo)
end
out.write line
end

histo.close
out.close
end



Now this seems efficient enough, taking just over a minute to build a random sample of 10000 from the ~4gb file (without certain checks - eg, to ensure we don't select the same random line twice) - however, I'm wondering if there is some more standard (eg, with some *nix command-line tool) way of doing this? I can't use something like shuffle.c as I don't want to read the entire file into memory, but rather just seek around it snatching lines at random...

Thanks for any input.


--------------------



Extras: Filter Print Post Top
OfflineSeussA
Error: divide byzero


Folding@home Statistics
Registered: 04/27/01
Posts: 23,480
Loc: Caribbean
Last seen: 1 month, 19 days
Re: random sample from large text-file? [Re: Krishna]
    #9904788 - 03/03/09 04:53 PM (15 years, 28 days ago)

Build an array of seek offsets first.  Next, sort the array.  Next, traverse the array and seek.  This way, if you have multiple reads within the same read buffer, you only read from the filesystem for those seeks once.  You also optimize your reads in order to help the OS if it is buffering read aheads.  I would also ditch buffered IO and use memory mapped IO (mmap) or perhaps pread().  Save the output to another array, filling it in the proper "random" order, then write it all at once.

Extras: Filter Print Post Top
InvisibleKrishna
कृष्ण,LOL
 User Gallery

Registered: 05/08/03
Posts: 23,285
Loc: oakland
Re: random sample from large text-file? [Re: Seuss]
    #9914046 - 03/04/09 10:59 PM (15 years, 27 days ago)

Quote:

Seuss said:
Build an array of seek offsets first.  Next, sort the array.  Next, traverse the array and seek.  This way, if you have multiple reads within the same read buffer, you only read from the filesystem for those seeks once.  You also optimize your reads in order to help the OS if it is buffering read aheads.



hell yes, definitely a smart idea. not that important when taking a 10k sample, but i'm sure the next step is going to be taking 1000 10k samples (without getting into the details of what this is for so i don't get fired, i'm doing a single sample to come up with any snags, etc and then passing off a bunch of samples with my notes to less technically-apt data-processing folks)

Quote:

I would also ditch buffered IO and use memory mapped IO (mmap) or perhaps pread().  Save the output to another array, filling it in the proper "random" order, then write it all at once.



also a smart tip. this current job is the first time i've had to work with really large data-sets (working in NLP/computational linguistics) and I still find, 9 months later, that i'm always forgetting certain basic things to keep in mind when dealing with huge amounts of data, and having to go back after i've run the code on a sample set and optimize stuff like this.

thanks for the input Seuss. I know this forum is generally used for algorithm discussion, but I figured (hoped) you if nobody else would chime in with some input. :smile:


--------------------



Extras: Filter Print Post Top
Jump to top Pages: 1

Shop: PhytoExtractum Kratom Powder for Sale, Maeng Da Thai Kratom Leaf Powder   Unfolding Nature Unfolding Nature: Being in the Implicate Order


Similar ThreadsPosterViewsRepliesLast post
* Microsoft's Really Hidden Files. Lana 7,925 18 07/11/02 02:14 PM
by Lana
* NFO file?
( 1 2 all )
Aiko Aiko 1,507 21 11/04/06 09:28 PM
by Disco Cat
* Add files to .iso pshawny 1,219 11 09/30/05 01:42 PM
by pshawny
* Here is another file/email encryption program... Lana 972 2 07/29/02 03:43 PM
by Lana
* Increasing Quality of mp3 files?
( 1 2 all )
Robo 4,363 36 01/18/08 07:20 AM
by HELLA_TIGHT
* text doc.s and .....deleting stuff for good DazedSol 1,592 7 12/15/02 10:18 AM
by Seuss
* anyone familiar with BFS/DFS/A* Search Algorithms? kotik 639 4 10/09/07 05:30 PM
by Seuss
* edit: problem solved ShroomismM 1,052 5 08/12/04 03:00 PM
by Shroomism

Extra information
You cannot start new topics / You cannot reply to topics
HTML is disabled / BBCode is enabled
Moderator: trendal, automan, Northerner
3,836 topic views. 0 members, 1 guests and 2 web crawlers are browsing this forum.
[ Show Images Only | Sort by Score | Print Topic ]
Search this thread:

Copyright 1997-2024 Mind Media. Some rights reserved.

Generated in 0.025 seconds spending 0.007 seconds on 14 queries.