Thursday, March 29, 2007

How to Analyze a Trillion Log Messages?

Somebody posted a message to a loganalysis list seeking help with analyzing a trillion log messages. Yes, you've heard it right - a trillion. Apart from some naive folks suggesting totally unsuitable vendor solutions, there was one smart post from Jose Nasario (here), which implied that the original poster will need to write some code himself. Why?

Here is why (see also my post to the list):  assuming 1 trillions records of 200 bytes, which is a typical
PIX log message size (a bit optimistic, in fact), we are looking at roughly 180TB of uncompressed log data. And we need to analyze it (even if we are not exactly sure for what, hopefully the poster himself knows) ... not just to store.

Thus, I hate (ehh, make it "have" :-)) to admit that Jose is probably right:  writing purpose-specific code might be the only way out. About a year ago, there was a discussion titled "parsing logs ultra-fast inline" on firewall-wizards list about something very similar. We can look up some old posts by Marcus Ranum for useful tips on super-fast but purpose-specific log processing.

For example, here he suggests a few specific data structure to "handle truly ginormous amounts of log data quickly" and concludes that "this approach runs faster than hell on even low-end hardware and can crunch through a lot of logs extremely rapidly." One of the follow-ups really hits the point that I am making here and in my post:  "if you put some thought into figuring out what you want to get from your log analysis, you can do it at extremely high speeds." A few more useful tips are added here.

So, nothing much we can do here - you are writing some code here, buddy :-) And, as far as tips are concerned, here is the "strategy" to solve it:

1. figure out what you want to do

2. write the code to do it

3. run it and wait, wait, wait ... possibly for a long time :-)

Indeed, there are many great general purpose log management solution on the market. However, we all know that there is always that "ginormous" amount of data that calls for custom code, optimized for the task.

Dr Anton Chuvakin