Tuesday, March 11, 2008

Logs: Parsing, Tokenizing or Extracting?

As you know, I have long been on a quest to save the world from having to write long and ugly regular expressions (regexes) for log analysis. Back in 2005 (post, big discussion that ensued) and later in 2007 (post, another big discussion that again ensued), I have tried to poll people for approaches that convert logs into useful information without messing with massive quantities of regular expressions as well as performed some research on my own. In all honesty, I didn't notice a major breakthrough.

Until now? Here ("prequel" here and follow-up here) is what looks like an interesting and major development along that  line. Indeed, one can automate the processing of some "self-describing" log formats (name=value pairs, comma/tab delimited with descriptive header, sequential names and values [yuck!], XML, etc) to obtain a semblance of structured data (not just a flow of text logs)  from logs without any human involvement.

But is that an endgame, that "holy grail" of log analysis or yet another step towards it?  First, bad logs break it (e.g. with space in names or values with spaces and without quotes) and thus call for a return of a human logging expert to write an even fancier regex that can deal with it (then again, bad logs often break human-written rules as well). Second, there is a more important issue that I will bring up. So, if logs contain "user=jsmith" we can certainly learn a new piece of info (that the "user" was probably "jsmith"). But what if they contain "bla_bla=huh_huh" - and we don't know what "bla_bla" and "huh_huh" mean? Do we really have more information at hand if we tokenize it as "object called 'bla_bla' has the value of 'huh_huh'" compared to just having a single blurb of text "bla_bla=huh_huh." I personally don't think so - but I've been known to be wrong before :-)

So, let's review what we have: I decided to organize the current approaches to logs in the form of this table (hoping to start a discussion!)

  Text Indexing Field  Extraction (Algorithmic) Rule-based Parsing (Manual)
Pros Easy - no human effort needed: just collect the logs and go Easy - no per-log effort on behalf of the log analyst (but some creative code needs to be written) Hard - an expensive logging expert must first understand the logs and then write the rules; normalization across devices implies having a uniform data store for logs
Cons Output is low quality information; rather, a flow of raw data (needs more analysis) Mixed - some new information emerges, but not in all cases (and you can't predict when)
In general, no cross-device analysis is enabled ('user'  is not the same as 'usr' in other log)
High-quality output: tables, graphics, summaries and easy correlation across diverse log sources (highly useful information!)

So, what can we conclude? It is too early to retire the human-written rules (so people will still have '\s' and '\w' coming up in bad dreams... :-)), but this automated approach should definitely be used on the logs that will "allow you to do it to them." :-) Personally, I am also very happy that somebody is thinking about such matters ...

Comment away!

Dr Anton Chuvakin