As I am going through my backlog of topics I wanted to blog about (but didn’t have time for the last 4-6 months), this is the one I really wanted to explore. Here is the scenario:
- Something blows up, hits the fan, starts to smell bad, <insert your favorite incident metaphor> … either in your IT environment or at one of your clients’
- Logs (mostly) and other evidence is taken from all the components of the affected system and packaged for offline analysis
- You get a nice 10MB-10GB pile of juicy log data – and they wants “answers”
- What do you do FIRST? With what tools?
Let’s explore this situation. I know most of you would say “just pile’em into splunk” and, of course, I will do that. However, that is not a full story. As I point out in this 2007 blog post (“Do You Enjoy Searching?”), to succeed with search you need to know what to search for. At this point of our incident investigation, we actually don’t! Meanwhile, the volume of log data beyond a few megabytes makes “trial and error” approach of searching for common clues fairly ineffective.
If you received any hints with the log pile (“I think the user ‘jsmith’ did it” or “it seems like 10.1.3.2 IP was involved”), then you can search for this (and then branch out to co-occurring and related issues and drill-down as needed), but then your investigation will suffer from “tunnel vision” of only seeing this initially reported issue and that is, obviously, a bad idea.
Let’s take a step back and think: what do we want here? what is our problem? We want a way to explore ALL the logs in a pile, across log types, across devices, across all time AND then also following a timeline of events. In other words, we ain’t in “searchland” here, buddy…
If you have an enterprise SIEM sitting around (and one with well-engineered support for diverse historical log imports – which is NOT a certainty, BTW), you should definitely load the logs there as well. I like this approach since you can then run cross-device summary reports over the entire set, slice the set in various ways (type of log source, log source instance, type of log entry – categorized, time period filter, time trend, etc) and data visualization tools (treemaps, trend lines, link maps, and other advanced visuals on parsed, normalized and categorized) help get a big picture view of our pile.
Looking at the open source log tools, does anything look promising for the task? OSSIM can do the trick (even though their historical log import is not my favorite), but nothing else does. In some cases, I used sawmill (free trial) for my “big picture first look”, but it is not cross-device and only shows reports for each log type individually. If I were feeling really adventurous (and was on hourly billing), I could actually send all the logs via a syslog streamer into OSSEC (in order to see the log entries the tool will flag as interesting/alertable), but this is not really something I’d enjoy doing. I am almost tempted to say that you can use something like afterglow, but it relies on parsed data that you’d sill need to cook somehow (such as again using a SIEM). Log2timeline is useful, but only for one dimension – and the one that splunk actually addresses pretty well already.
To generalize, you need (a) a search tool and (b) an exploration tool. The search tool should help you quickly answer SPECIFIC questions. The exploration tool should use data to generate “hints” on WHAT questions you should start asking…