Tuesday, April 11, 2017

Why Your Security Data Lake Project Will FAIL! [BACKUP FROM DEAD GARTNER BLOG]

NOTICE: after Gartner killed ALL blogs in late 2023, I am trying to salvage (via archive.org) some of the most critical blogs I've written while working there, and repost them with backdates here, for posterity. This one reminds the world that I was a huge skeptic of security data lakes.


 Beats me, but for some reason organizations think that they can build A SECURITY DATA LAKE and/or their own CUSTOM BIG DATA SECURITY ANALYTICS tools. Let me tell you what will happen – it will FAIL.

Cue the data swamp jokes. Mention data pond scum. Discuss pissing in the data pool. The result is the same – it likely won’t work.

OK, let me tone this down a bit – it will be successful (however this is defined) for 0.1% of those who try [the percentages are approximate and are meant to increase the dramatic impact of this post, not to share data]

Why am I so adamant about it? During our UEBA research we encountered several organizations that are migrating from DIY/custom security analytics to COTS (typically to UEBA as it has matured). What truly shocked us was that some organizations reported that they did have a custom security analytics project running for a few years – but it is now being shut down due to “huge effort – low value” combination. What was even more shocking, some of the organizations were essentially in a “Fortune 50” class, presumably those global technology elites. It didn’t even work for [some of] them…. The QotD [modified to remove any possible relation to the client] was “we wish we’d never discovered Hadoop – we wasted years of trying to make a security analytics capability out of it.”

Motivated by the use of cheap hardware, reduced data redundancy (store one copy – so wow!) and promise of advanced analytics they went for it … and mostly FAILED.

Some of the reasons for failure or relative lack of success included:

  1. Dirty data – you throw stuff in and then cannot use it; a #1 “fail-cause” (great story about it)
  2. Trouble with collecting data – SIEM vendors spent 10+ years debugging their collectors for a reason…
  3. Trouble with accessing data – data went in – plonk!- and now nobody knows how to get it out to do analysis (great story here)
  4. No value beyond collection – the data lake was created and filled with data, so it is there just in case, but any subsequent project phases stumbled
  5. No value beyond keyword search – data lake was created to enable advanced analytics, but ultimately delivered only basic keyword search of logs
  6. No threat detection value – this happened when somebody hired a big data company to build a security data lake; they build all the plumbing and said “ah, security use cases? you do it!” and left
  7. Failure to conceptualize and define the security analytics use cases – OK, we will now detect threats… OK, how? Well, nobody knows and no time to experiment. And see #1 – dirty data
  8. Security analytics use case design much harder then expected
  9. Much higher bar for analytics and big data expertise talent requirements and failure to acquire said talent.

(note that some are overlapping and/or related)

As we say here, “Given the simplicity of the technical characteristics of a data lake, it shouldn’t come as a surprise that getting value out of this concept is entirely dependent on the availability of advanced programming and analytics skills.” For security, you also need to add threat analysis skills to the mix.

In essence, the only successful project type (and this is not really security analytics, not by a long shot) was “install ELK, throw logs in, search for keywords.” This works well, but this is NOT what they aspired for – not even close. Not even in the same realm.

To conclude, successful custom big data security analytics efforts remain rare outliers, like a flying car. My 2012 post was full of hope – and sadly it didn’t work out. At this point, it is very clear to me that DIY or open source is NOT the way to go for security analytics. Sure, we will continue watching both Spot and Metron, but frankly at this point I am a skeptic.

So, short summary: open source – based log aggregation – sure, custom security analytics – only worked well for a very select few. If you still want to try, feel free to review this for some ideas (if you read it, provide feedback here!). It seems like this document will NOT be updated anytime soon…

Related blog posts on security analytics:

Dr Anton Chuvakin