choose one millon random lines from a file containing one billion lines

This is an interesting problem and I can imagine lots of data intensive companies must be facing these types of challeges where they need to analyse the weblogs. So let me write down the problem statement and look at various available solutions

"Given an apache webserver logfile containing appx 1 billion entries, write an algorithm to select 1 million entries randomly from this file"

Solution:

Here are a few assumptions

1) Because the problem states that the file has appx 1 billion lines, I will assume that we don't know the exact number of lines in the files
2) I will assume that I don't have enough memory in my computer to hold all 1 billion lines. But I do have memory to hold 1 million lines and also I do have a data-structure that can count untill 1 million (longint ?)


Here is an algorithm I propose
$N=1000000
for (1..$N)
{
push @lines, scalar <>;
}


while(<>){
#Generate a random number between 0 and the current line number
$no = $.*(rand());
if($no < $N) { $line[rand $N] = $_; #Now if the random number generated is less than 1,000,000 then randomly select a lines to be replaced from already selected 1m lines and then replace that with the current line } } print @lines; Here two things happen - First as we progress through the file, we generate a random no between 0 & line# and if this random number is <>

Comments

Popular posts from this blog

Impossible