Thursday, March 29, 2007

How to Analyze a Trillion Log Messages?

Somebody posted a message to a loganalysis list seeking help with analyzing a trillion log messages. Yes, you've heard it right - a trillion. Apart from some naive folks suggesting totally unsuitable vendor solutions, there was one smart post from Jose Nasario (here), which implied that the original poster will need to write some code himself. Why?

Here is why (see also my post to the list):  assuming 1 trillions records of 200 bytes, which is a typical
PIX log message size (a bit optimistic, in fact), we are looking at roughly 180TB of uncompressed log data. And we need to analyze it (even if we are not exactly sure for what, hopefully the poster himself knows) ... not just to store.

Thus, I hate (ehh, make it "have" :-)) to admit that Jose is probably right:  writing purpose-specific code might be the only way out. About a year ago, there was a discussion titled "parsing logs ultra-fast inline" on firewall-wizards list about something very similar. We can look up some old posts by Marcus Ranum for useful tips on super-fast but purpose-specific log processing.

For example, here he suggests a few specific data structure to "handle truly ginormous amounts of log data quickly" and concludes that "this approach runs faster than hell on even low-end hardware and can crunch through a lot of logs extremely rapidly." One of the follow-ups really hits the point that I am making here and in my post:  "if you put some thought into figuring out what you want to get from your log analysis, you can do it at extremely high speeds." A few more useful tips are added here.

So, nothing much we can do here - you are writing some code here, buddy :-) And, as far as tips are concerned, here is the "strategy" to solve it:

1. figure out what you want to do

2. write the code to do it

3. run it and wait, wait, wait ... possibly for a long time :-)

Indeed, there are many great general purpose log management solution on the market. However, we all know that there is always that "ginormous" amount of data that calls for custom code, optimized for the task.

5 comments:

Anonymous said...

And then there is always visualization. once you have cut down the original log volume a bit. Viz won't help you process events or log entries really fast, but once you have reduced the size a bit, you will be able to get an understanding of what you are dealing with.
Another approach would be to visualize a small portion of your original logs to get a handle on what you are dealing with at all. Then the next step would be to develop a script (i.e., filter) to reduce the amount of input logs to what you are interested in.

Andrew Hay said...

Hey Anton,

When I first learned of the volumn of data that this person wanted to work with the first thought that went through my mind was "why so many logs?"

Granted, the poster works for Ernst & Young and I'm sure there are massive amounts of logs to work with, but I have a hard time believing that the devices in question are capable of logging back to one central point easily.

I suspect that the question was brought forward as the result of a senior management decision to centralize logging. I also think the total count of logs may be based on all devices deployed globally, but I could be wrong.

Anton Chuvakin said...

Raffy, don't be ridiculous :-)

Trillion messages and visualization just don't mix....

And using "Another approach would be to visualize a small portion of your original logs " approach makes the whole problem really simple: ah, just look at a subset. WHICH subset?

Also how one is supposed to "reduce the amount of input logs to what you are interested in"? This is kinda the whole question here, isn't it?

Anonymous said...

Anton,

I would slightly disagree with you.
If you know what is it you want to get out of the analysis then you know what is the initial data subset relevant for the outcome, right? Especially if there is no need for cross correlation and you know which format of a message your relevant data subset has or which constant data attribute(s) it contains you can always filter out the rest of the data saving processing time.
If there is a need in cross correlation and especially for heuristic search and not rule based correlation then, I would say, there is no data subset in the original logs that we are interested in and such subset needs to be created via the normalization process before the analysis process begins.

Cheers,
Dmitri

Anton Chuvakin said...

Well, maybe you are right;_In this case, we just didn't know what he wanted, just to "analyze the messages" somehow...

Dr Anton Chuvakin