Feb 212012
After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)
#!/usr/bin/python import os import re where = '.' # source directory ls = os.listdir(where) for f in ls: m = re.search('.*_COMM-([0-9]+).xml', f) if m: subdir = "%03d" % (int(m.group(1)) / 1000) try: os.mkdir(subdir) except OSError as e: pass os.rename(f, os.path.join(subdir, f))
Now on to the random selection, again with Python:
#!/usr/bin/python import os import random import re import sys if len(sys.argv) > 1: where = sys.argv[1] else: where = '.' # source directory subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where)) subdir = os.path.join(where,random.choice(subdirs)) print os.path.join(subdir,random.choice(os.listdir(subdir)))
A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:
for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;