Oct 152013

UPDATE 2013-10-16: Macports now has a mysql_select package that cleanly solves this problem. Run the following and then pip will be able to find mysql_config without issue.

sudo port install mysql_select
sudo port select mysql mysql56

Using pip to install MySQL-python (aka MySQLdb) gives the error “EnvironmentError: mysql_config not found” when run on a system where MySQL has been installed via MacPorts. The solution is to tell the installer where mysql_config can be found by appending “mysql_config = /opt/local/bin/mysql_config5” to site.cfg. Assuming use of virtualenvwrapper (highly recommended!):

pip install --no-install MySQL-python
... ignore errors ...
VENV=$(dirname $(dirname $(which python))); echo "mysql_config = /opt/local/bin/mysql_config5" >> $VENV/build/MySQL-python/site.cfg
pip install --no-download MySQL-python
Oct 152013

Virtualenvwrapper is a great way to manager Python environments. This is a quick cheatsheet for using it.


Get Python

On OS X:
<install MacPorts>

sudo port install python27
sudo port select python python27

Install standard packages

 sudo easy_install pip
sudo pip install virtualenv
sudo pip install virtualenvwrapper
sudo pip install yolk

Configure virtualenvwrapper

In your .zshrc, .bashrc, etc, add:

source $(dirname $(which python))/virtualenvwrapper.sh

In your .zshenv, .bashrc, etc., add:

export WORKON_HOME=~/.virtualenvs

Create .virtualenvs directory:


List available environments


Make an environment

Note: environments are stored in your ~/$WORKON directory and you can issue these commands from anywhere.

mkvirtualenv myproject

Select an environment

workon myproject

Within an environment, pip install packages as usual.

See the Virtualenvwrapper docs for more information.

Feb 212012

After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)

import os
import re
where = '.' # source directory 

ls = os.listdir(where)
for f in ls:
  m = re.search('.*_COMM-([0-9]+).xml', f)
  if m:
    subdir = "%03d" % (int(m.group(1)) / 1000)
    except OSError as e:
    os.rename(f, os.path.join(subdir, f))

Now on to the random selection, again with Python:

import os
import random
import re
import sys

if len(sys.argv) > 1:
  where = sys.argv[1]
  where = '.' # source directory 

subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))
subdir = os.path.join(where,random.choice(subdirs))
print os.path.join(subdir,random.choice(os.listdir(subdir)))

A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:

for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;



Apr 122011

Yesterday I needed to decipher a log file in which a dozen threads were simultaneously logging messages. Surely there must be tools for this out there. But I couldn’t find one, so I wrote a Python script to indent each line differently based on thread id. I then looked at it in LibreOffice, but just reading on the terminal would have sufficed. Here’s a trivial demo:

2011-04-11 09:40:12,004 [INFO] [http-3] Hello
2011-04-11 09:40:13,554 [DEBUG] [http-1] Wikipedia
(pronounced /ˌwɪkɨˈpiːdi.ə/ WIK-i-PEE-dee-ə)
2011-04-11 09:40:13,605 [INFO] [http-2] PCC Natural Markets
2011-04-11 09:40:13,688 [INFO] [http-3] World
2011-04-11 09:40:14,015 [INFO] [http-2] began as a food-buying club of 15 families in 1953.
2011-04-11 09:40:16,032 [INFO] [http-1] is a multilingual, web-based, free-content encyclopedia project based on an openly editable model
2011-04-11 09:40:17,775 [INFO] [http-2] Today, it's the largest consumer-owned natural food retail co-operative in the United States.


09:40:12,004	Hello
09:40:13,554		Wikipedia
09:40:13,554		(pronounced /ˌwɪkɨˈpiːdi.ə/ WIK-i-PEE-dee-ə)
09:40:13,605			PCC Natural Markets
09:40:13,688	World
09:40:14,015			began as a food-buying club of 15 families in 1953.
09:40:16,032		is a multilingual, web-based, free-content encyclopedia project based on an openly editable model
09:40:17,775			Today, it's the largest consumer-owned natural food retail co-operative in the United States.
            	http-3	http-1	http-2

Just read down the columns vertical for a clear chain of events on each thread. Code below is maintained on GitHub.

#!/usr/bin/env python

# Looks at token in a particular position in each line and indents the line
# differently for each unique identifier found in the file. For example, given
# a log file which contains a thread identifier, contents for each thread will
# be separated out into distinct columns.
# Lines not matching the pattern (e.g. stack traces) are presumed to have
# occurred at the time of and belong to the same identifier as the preceding line.
# Default pattern is: <date> <stamp> ignored [thread_id] <message>
# Yielding output: <stamp><tabs><message>
# An alternate regular expression can be supplied on the command line; it must
# include named capture groups 'stamp', 'id', and 'message'. The default regex
# is: ^\S+ (?P<stamp>\S+) \S+ \[(?P<id>[^\]]+)\] (?P<message>.*)
# If the input contains very long lines it can be helpful to truncate them
# beforehand by e.g. piping through awk '{print substr($0,0,400)}'

import sys,re
if len(sys.argv) > 1:
  pattern = re.compile(sys.argv[1])
  pattern = re.compile('^\S+ (?P<stamp>\S+) \S+ \[(?P<id>[^\]]+)\] (?P<message>.*)')

max_level = 1
categories = {}
legend = None
indent = ""
stamp = ""

  for line in [l.strip() for l in sys.stdin]:
    m = pattern.match(line)
    if m:
      stamp,identifier,message = [m.group(x) for x in ['stamp','id','message']]
      indent = categories.get(identifier)
      if not legend:
        legend = " " * len(stamp)
      if not indent:
        indent = delimiter * max_level
        categories[identifier] = indent
        max_level += 1
        legend += delimiter + identifier
      print stamp + indent + message
      # carry over stamp and indent from previous line
      print stamp + indent + line

  print legend
except IOError:
Apr 122011

Grepping through log files for lines that match a timestamp is fiddly. It’s hard to catch multi-line entries (e.g. stack traces) and to craft a regex that captures an exact time range. I wrote a little Python script to simpify the process.


 <example.log| ./by_time.py 9:40 9:44:15

Code below is maintained on GitHub.

#!/usr/bin/env python

# Selects time range from a log file. Lines with no time (e.g. stack traces)
# are presumed to have occurred at the time of the preceding line.
# Assumes first time-like phrase on a line is the timestamp for that line.
# Assumes time format is pairs of digits separated by colons with optional , or
# . initiated suffix. E.g. HH:mm:ss,SSS, HH:mm, etc.
# Does not strip blank lines; just use awk 'NF>0' for that.

import sys,re
time_pattern = re.compile("(?:^|.*?\D)(\d{1,2}(?::\d{2})+(?:[,.]\d+)?)")
fields_pattern = re.compile("[:,.]")

if len(sys.argv) < 3:
  print >> sys.stderr, "Please specify start and end times (e.g. %s 13:50 14:10:01,101)." % sys.argv[0]

for item,index in [["start time",1],["end time",2]]:
  if not time_pattern.match(sys.argv[index]):
    raise ValueError("Cannot parse %s: %s" % (item, sys.argv[index]))

start,end = [[int(x) for x in re.split(fields_pattern, s)] for s in sys.argv[1:3]]
too_soon = True

  for line in sys.stdin:
    line = line.strip()
    m = time_pattern.match(line)
    if m:
      t = [int(x) for x in re.split(fields_pattern,m.group(1))]
      if t >= end:
      elif too_soon and t >= start:
        too_soon = False

    if not too_soon:
      print line
except IOError: