log file

Yesterday I needed to decipher a log file in which a dozen threads were simultaneously logging messages. Surely there must be tools for this out there. But I couldn’t find one, so I wrote a Python script to indent each line differently based on thread id. I then looked at it in LibreOffice, but just reading on the terminal would have sufficed. Here’s a trivial demo:

2011-04-11 09:40:12,004 [INFO] [http-3] Hello
2011-04-11 09:40:13,554 [DEBUG] [http-1] Wikipedia
(pronounced /ˌwɪkɨˈpiːdi.ə/ WIK-i-PEE-dee-ə)
2011-04-11 09:40:13,605 [INFO] [http-2] PCC Natural Markets
2011-04-11 09:40:13,688 [INFO] [http-3] World
2011-04-11 09:40:14,015 [INFO] [http-2] began as a food-buying club of 15 families in 1953.
2011-04-11 09:40:16,032 [INFO] [http-1] is a multilingual, web-based, free-content encyclopedia project based on an openly editable model
2011-04-11 09:40:17,775 [INFO] [http-2] Today, it's the largest consumer-owned natural food retail co-operative in the United States.

becomes:

09:40:12,004	Hello
09:40:13,554		Wikipedia
09:40:13,554		(pronounced /ˌwɪkɨˈpiːdi.ə/ WIK-i-PEE-dee-ə)
09:40:13,605			PCC Natural Markets
09:40:13,688	World
09:40:14,015			began as a food-buying club of 15 families in 1953.
09:40:16,032		is a multilingual, web-based, free-content encyclopedia project based on an openly editable model
09:40:17,775			Today, it's the largest consumer-owned natural food retail co-operative in the United States.
            	http-3	http-1	http-2

Just read down the columns vertical for a clear chain of events on each thread. Code below is maintained on GitHub.

#!/usr/bin/env python

# Looks at token in a particular position in each line and indents the line
# differently for each unique identifier found in the file. For example, given
# a log file which contains a thread identifier, contents for each thread will
# be separated out into distinct columns.
#
# Lines not matching the pattern (e.g. stack traces) are presumed to have
# occurred at the time of and belong to the same identifier as the preceding line.
#
# Default pattern is: <date> <stamp> ignored [thread_id] <message>
# Yielding output: <stamp><tabs><message>
#
# An alternate regular expression can be supplied on the command line; it must
# include named capture groups 'stamp', 'id', and 'message'. The default regex
# is: ^\S+ (?P<stamp>\S+) \S+ \[(?P<id>[^\]]+)\] (?P<message>.*)
#
# If the input contains very long lines it can be helpful to truncate them
# beforehand by e.g. piping through awk '{print substr($0,0,400)}'

import sys,re
if len(sys.argv) > 1:
  pattern = re.compile(sys.argv[1])
else:
  pattern = re.compile('^\S+ (?P<stamp>\S+) \S+ \[(?P<id>[^\]]+)\] (?P<message>.*)')

delimiter='\t'
max_level = 1
categories = {}
legend = None
indent = ""
stamp = ""

try:
  for line in [l.strip() for l in sys.stdin]:
    m = pattern.match(line)
    if m:
      stamp,identifier,message = [m.group(x) for x in ['stamp','id','message']]
      indent = categories.get(identifier)
      if not legend:
        legend = " " * len(stamp)
      if not indent:
        indent = delimiter * max_level
        categories[identifier] = indent
        max_level += 1
        legend += delimiter + identifier
      print stamp + indent + message
    else:
      # carry over stamp and indent from previous line
      print stamp + indent + line

  print legend
except IOError:
  pass

Grepping through log files for lines that match a timestamp is fiddly. It’s hard to catch multi-line entries (e.g. stack traces) and to craft a regex that captures an exact time range. I wrote a little Python script to simpify the process.

Usage:

 <example.log| ./by_time.py 9:40 9:44:15

Code below is maintained on GitHub.

#!/usr/bin/env python

# Selects time range from a log file. Lines with no time (e.g. stack traces)
# are presumed to have occurred at the time of the preceding line.
#
# Assumes first time-like phrase on a line is the timestamp for that line.
#
# Assumes time format is pairs of digits separated by colons with optional , or
# . initiated suffix. E.g. HH:mm:ss,SSS, HH:mm, etc.
#
# Does not strip blank lines; just use awk 'NF>0' for that.

import sys,re
time_pattern = re.compile("(?:^|.*?\D)(\d{1,2}(?::\d{2})+(?:[,.]\d+)?)")
fields_pattern = re.compile("[:,.]")

if len(sys.argv) < 3:
  print >> sys.stderr, "Please specify start and end times (e.g. %s 13:50 14:10:01,101)." % sys.argv[0]
  exit(1)

for item,index in [["start time",1],["end time",2]]:
  if not time_pattern.match(sys.argv[index]):
    raise ValueError("Cannot parse %s: %s" % (item, sys.argv[index]))

start,end = [[int(x) for x in re.split(fields_pattern, s)] for s in sys.argv[1:3]]
too_soon = True

try:
  for line in sys.stdin:
    line = line.strip()
    m = time_pattern.match(line)
    if m:
      t = [int(x) for x in re.split(fields_pattern,m.group(1))]
      if t >= end:
        break
      elif too_soon and t >= start:
        too_soon = False

    if not too_soon:
      print line
except IOError:
  pass

lankycoder

Untangling multi-threaded log files

Snipping log files by time