Thursday, November 15, 2007

Split 2 million records into 5000 groups of random lists

Problem:
A list of 2 million names.
Create 5000 groups of names with 51%(or 2550) having 100 names from the list, 29%(or 1450) with 500, 8%(400) with 1000, 6%(300) with 25000, 3%(150) with 50000, 3%(150) with 100000. The added complexity is that items in each group should have random set of names from the master list.

My Solution:
I used Ruby to create a simple script:
1. I dumped the list of 2 million names into sqlite db.
2. Created a script along the following lines:
....
db.execute("select duns, name from subject order by random() limit 100") do |row|
file.puts "#{row[0]}"
end
....

This solution took about 10 hours to execute for my entire list. I wonder if there is a much simpler way to do this.

BTB, my Ruby editor of choice is now Netbeans, it has neat features - code completion is my favourite and the ability to edit ad-hoc files is another. My previous editor of choice was Eclipse with ruby plugins