Tuesday, January 15, 2008

I Should Really Not Touch This ....

... I really should not. But - darn it! - how can I miss a potential blog fight related to log management?

So, it seems like Raffy baited some poor folks from Prism with his post on "IT search" (what an abomination of a term!). But, seriously, "IT search" is a marketing term (nothing wrong with that, BTW!), so it will mean whatever the folks who coined feel at any given moment. I really hate it when folks try to argue objectively with a clear fluke.

I think this debate is mostly about two approaches to logs: collect and parse some logs (typical SIEM approach) vs collect and index all logs (like, ahem, "IT search").

You can see where this one is going, right? :-)

Yes, Virginia! You do need to do BOTH - and you know who does both? LogLogic!


Anonymous said...

I'm with you on this one Anton, both activities are important in a log management solution. You need to parse and heavily utilize the most important (and useful) messages for you, but also store and make available via search or some other facility the full text of every log you've collected.

But, you know, LogLogic isn't the only vendor who does that *wink*

Anton Chuvakin said...

Well, I think I know who you mean :-) But not just "store and make available via search"! Indexing is important since it makes the difference between 'google-speed' and 'grep-speed' which may well be 10000 times different...

Anonymous said...

Hmm, I'm not sure how requisite indexing is. I've done some testing over here, and while indexing makes it easier to produce a fast search over larger sets of data, gool ol' grep works faster against smaller sets. (Compared to, say, cLucene using 1MB of input data. And, by grep, I mean a fairly optimized perl script with some RE's =)

Not to say that indexing in general isn't a requisite part of any long-term archiving+searching solution, but that you can skin that cat a few ways and still end up with a fast, hairless feline!

I'm still confused though, are people actually aruging that you should have a product that doesn't actually store the raw logs somewhere? If so - I'm not sure you could effectively call that 'log management' -- 'log analysis', maybe, but certainly not managing the logs themselves.

Anton Chuvakin said...

>gool ol' grep works faster against
>smaller sets.

Ah, come on. Such small data sets are not interesting at all. Don't think 'grep+file', think Google + internet. Do you want to do the internet searching by grepping all the webpages? :-)

If you move into GB sizes, grep loses.

>are people actually aruging that
>you should have a product that
>doesn't actually store the raw
>logs somewhere

Not much nowadays, but some SIEM people still say that at times ...

Anonymous said...

I don't even know where to start ... [Anton, you knew I would comment on this ;)]

1. About Chris' testing. Don't use Lucene. We (Splunk) built our own time-optimized index that performs much better and supports our search needs. But that's just a side comment.
2. IT search != grep - I agree with Anton's comment about marketing terms and all. However, there is actually something else behind IT search that you guys are ignoring. IT search has to do with dynamic schemas. You want to be able to build your "top users" report. How else can you do that, if you don't apply some sort of a schema? Not at all. You have to somehow parse out the user names from your data. Therefore, it is crucial to have a schema, but IT search imposes the schema when it is needed (at search time) and to the data that needs it. It does not waste processing power and storage to support a static schema that causes a lot of headaches.
3. IT search supports multi-line data and files and not just single-line log records! I challenge you guys to index configuration files with any of the log management solutions. Good luck!

Does this make sense?

Anton Chuvakin said...

Ah, this is fun.

First, I admit to this post being a bit of "Raffy-bait" :-)

>IT search has to do with dynamic

Raffy, please finish you SIEM vs "IT search" post series :-) Otherwise, dynamic schemas will sound like poor normalization. If you have a schema per search output, you'd never be able to show 'top users' across all log sources...

>It does not waste processing power
> and storage to support a static

That is obviously smart - but think what you lose when you forfeit a static schema: easy reports across log sources...

>IT search supports multi-line data
> and files and not just
>single-line log records

What you meant to say is that spl was the first to index multi-line logs. Big deal! Others will update (or have updated) their tools to do the same. There is nothing inherent in "IT search" that allows it to analyze multi-line logs. Thus, this is not really an advantage.

Anonymous said...

Anton, now now - we certainly think in 'google' mode (remember, we have all of our customers' log and IDS data - we don't deal with just one customer, but all - at once =). My point was, sometimes we get trained to think in one way and ignore some obvious tests and their conclusions.

Raffy, we've developed our own tech. as well - obviously there's only so much detail I can go into, but suffice to say we'll have no issue searching several years worth of log data - fulltext or fully normalized, in "real-time" (i.e. "google-time").

Now, here's a question I have - it's been a while since I've played with splunk, but last time I did, it looked like the 'dynamic schemas' were simply a matter of text-based tokenization and then requiring the user to identify the type of the tokens found. If this is the case, isn't that "dynamic schema" just a "user-defined" schema?

Anton Chuvakin said...

>isn't that "dynamic schema" just a
>"user-defined" schema

In fact, it is suckier than that; in some cases 'dynamic schema' just means 'no schema': such system will not know that usr=jsmith and user=jsmith mean the same. "Dynamically", they are different things since 'user' is not the same as 'usr'

There is only so much you can automate in metadata extraction. A human analyst brain is, sadly, still needed :-)

Dr Anton Chuvakin