Tuesday, October 24, 2023

Detection Engineering is Painful — and It Shouldn’t Be (Part 1) [Medium Backup]

 This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

This post is our first installment in the “Threats into Detections — The DNA of Detection Engineering” series, where we explore opportunities and shortcomings in the brand new world of Detection Engineering.

Detection Engineering Defined

As many of you already know, detection engineering is the process of building, refining and managing detection content (rules, content, code, logic — whichever word is most suitable for you). It is a relatively new discipline, but it is rapidly gaining importance as the threat landscape becomes increasingly complex and (for top-tier threat actors) more targeted to each environment.

It rewrites the classic SOC-building handbook by putting an emphasis on detection quality from the start, and dedicates capacity directly to the content engineering concern (as we say multiple times in our Autonomic Security Operations materials). Detection engineers also work closely with other security teams, such as threat intelligence and incident response, to ensure that their detections are developed quickly and work well.

It is a challenging field that requires a deep understanding of both security and software engineering. On the security side, detection engineers need to be able to identify and understand the latest threats and attack techniques. They also need to be able to develop and maintain detection rules and signatures that can accurately identify these threats. On the software engineering side, detection engineers need to be able to develop and maintain detection systems that are scalable, reliable, and efficient. They also need to be able to tune and optimize these systems to reduce false positives and negatives.

There are many broad challenges that detection engineers face, including:

  • Messy threat landscape: The tactics and tools that attackers use are constantly changing, making it difficult to keep up with the threats. Understanding attack chains performed by threat actors is complex work that needs to be performed both quickly and reliably (false negatives and/or false positives can ruin the program). Naturally, attackers also have an interest in not being detected, at least in some cases.
  • The need for speed: Detection systems need to be able to identify threats quickly in order to minimize the damage they cause — new detection content needs to be rolled out ASAP once threat intelligence is received. All this should work in the context of increasing volumes of telemetry data, without incurring the engineer burnout.
  • The complexity of data and systems: Detection systems need to be able to process large amounts of data from a variety of sources, including network traffic, cloud services and endpoint data. When logs are not parsed properly, or doesn’t contain required data, detections lean on the poorer side: garbage in, garbage out. Reliable and fast detection engineering for IT “layer cake” from mainframes to containers, present in many environments, is not getting easier. The age of “I just need to know a few popular Windows event IDs to detect” is long over.

With capable threat actors mastering both LOL and custom malware, and with the realization that a lean, reproducible, and efficient workflow is required to go from threat to detections, we have seen more signs of the evolution from generic security analysts who operate on canned or lightly tuned detection content to more dedicated roles like threat hunters and detection engineers. We are also seeing further signs of the common L1/L2/L3 concept progressively dissolving, which moves the role of the detection engineer to the front of the detection battle.

Things As They Stand

At many companies, Detection Engineers started to be differentiated as a separate role 2–3 years ago (and, yes, we have seen organizations that had similar roles for a decade or more), and while they’re now in demand, they are still somewhat of a rare breed.

In some organizations, an L3 analyst tends to be designated to do the same activities as detection engineers, but also have the double (triple if they also hunt) duty to perform investigations, or hunt threats without always plugging into new detections. This is less optimized than having a fully dedicated capacity, but is indicative of the push toward conscious detection content development (it is perhaps better for an overburdened L3 to do this compared to … nobody, at least as a transition stage)

On the other hand, threat intelligence teams (outside of SOC) have had a hard time scaling to detection engineering needs. This leaves detection engineers often having to do the bulk of the threat research required to build detection backlog items and often even having to select intelligence sources themselves. Ideally, this process would be collaborative, to allow a continuous and smooth handover of actionable CTI data to detection engineering teams.

In the next part, we will cover scaling the intelligence to detection content pipeline.

UPDATE: the story continues in “Detection Engineering and SOC Scalability Challenges (Part 2)”

Related blogs:

Detection Engineering and SOC Scalability Challenges (Part 2) [Medium Backup]

 This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

This post is our second installment in the “Threats into Detections — The DNA of Detection Engineering” series, where we explore the challenges of detection engineering in more detail — and where threat intelligence plays (and where some hope appears … but you need to wait for Part 3 for this!)

Contrary to what some may think, a detection and response (D&R) success is more about the processes and people than about the SIEM. As one of the authors used to say during his tenure at Gartner, “SOC is first a team, then a process and finally a technology stack.” (and just repeated this at mWISE 2023) And here is another: “A great team with an average SIEM will run circles around the average team with a great SIEM

SIEMs, or whatever equivalent term you may prefer (A security data lake perhaps? But please no XDR… we are civilized people here), are essentially large scale telemetry analysis engines, running detection content over data stores and streams of data. The signals they produce are often voluminous without on-site tuning and context, and won’t bring value in isolation and without the necessary process stack.

It is the complex cyber defenders’ knowledge injected at every step of the rule creation and alert (and then incident) response process that is the real value-add of a SOC capability. Note that some of the rules/content may be created by the tool vendor while the rest is created by the customer.

So, yes, process is very important here, yet under the shiny new name of TDIR (Threat Detection and Incident Response), lies essentially a creaky process stack riddles by inefficiencies and toil:

  • Inconsistent internal documentation — and this is putting it generously, enough SOC teams run on tribal knowledge, and even an internal Wiki would be a huge improvement for them
  • Staggered and chaotic project management — SOC project management that is hard to understand and improve, that doesn’t happen smoothly, with release/delivery process that is completely irregular and traceability is often lost midway through the operational noise.
  • No blueprint to do things consistently — before we talk automation, let’s talk consistency. And this is hard with ad hoc process that is reinvented every time…
  • No automation to do things consistently and quickly — once the process is clear, how do we automate it? The answer often is “well, we don’t”, anyhow see the item just above…
  • Long onboarding of new log sources — while the 1990s are over, the organizations where a SOC needs to shove paper forms deep inside some beastly IT organizations to enable a new log source have not vanished yet.
  • Low awareness of removed or failed log sources — SOCs with low awareness of removed or failed log sources are at risk of missing critical security events and failed — worse, quietly failed — detections.
  • Large inertia to develop new detection content, low agility — if you make an annual process into quarterly, but what you need is a daily response, have you actually improved things?
  • Inscrutable and unmaintainable detection content — if the detection was not developed in a structured and meaningful way, then both alert triage and further refinement of detection code will ..ahem … suffer (this wins the Understatement of the Year award)
  • Technical bias, starting from available data rather than threats — this is sadly very common at less-mature SOCs. “What data do we collect?” tends to predate “what do we actually want to do?” despite “output-driven SIEM” concept having been invented before 2012 (to be honest, I stole the idea from a Vigilant consultant back in 2012).

While IT around your SOC may live in the “future” world of SRE, DevOps, GitOps and large scale automation, releasing new detections to the live environment is, surprisingly, often heavy on humans, full of toil and friction.

Not only is it often lacking sophistication (copy pasting from a sheet into a GUI), but it is also not tracked or versioned in many cases — which makes ongoing improvement challenging at best.

Some teams made good progress toward automation by using detection as-code, but adoption is still minimal. And apart from a handful of truly leading teams, it is often limited to deploying vendor-provided rules or code from public repositories (ahem, “detection as code written by strangers on the internet”, if you’d like…). As a result, it then poses a real challenge of reconciling internal and external rule tracking.

An astute reader will also point out that the very process of “machining” raw threat signals into polished production detections is very artisanal in most cases; but don’t despair, we will address this in the next parts of this series! This would be fun!

Apart from that, much of the process of creating new detections has two key problems:

  • Often it starts from available data, and not from relevant threats.
  • Prioritization is still very much a gut feeling affair based on assumption, individual perspective and analysis bias.

Instead, there should be a rolling evaluation of relevant and incoming threats, crossed with current capabilities. In other words, measuring detection coverage (how well we detect in our environment against the overall known threat landscape) which allows us to build a rolling backlog of threats to detect, identify logging / telemetry gaps and key improvement points to steer detection content development. This will turn an arts and crafts detection project into an industrial detection pipeline.

💸 How about relying on vendors ?

What about avoiding all the above trickery, and relying on a wise third party for all your detection content? Well, not only does external detection content quality vary drastically from provider to provider, such dependency can occasionally become counter-productive.

In theory, the MSSPs / MDRs with a pristine ecosystem to research, build and release should have solid detections for most clients conventional threats, but the good ones are far and few between. They often cannot spend their development time to create custom detections (“economies of scale” argument that is often used to justify an MSSP actually prevents that). Instead, some build broad, generic detection logic, and then spend their — and customer! — time on tuning out False Positives on the live environments of their clients.

From the end-client perspective, there is neither a guarantee nor complete understanding that the process is going measurably well (especially, and I insist here, in regards to False Negatives) and that it actually leads to increase in client detection coverage. In this regard, some would say that MSSPs / MDRs with regard to detections and detection coverage compete in market of lemons .

For internal or co-managed SOCs (where a small internal team works with a MSSP), relying excessively on externally sourced rules also has the occasional side-effect of lowering the actual understanding of both the threat and the detection implementation, further encouraging a downward spiral.

When working the other way around, handing over in-house detections to the provider for alert response, there are often slowness and protests as it interferes in their processes and they are (legitimately) concerned that their analysts won’t understand how to process the incidents since they lack on-site context and tribal knowledge. This plays a real part in the industry where a 70% False Positive rate is rather common, by over-relying on response capacity to tune out noisy rules (or, a beautiful SOAR at the output of an ugly SIEM, now a classic!), rather than having a defined development lifecycle where lowering FPs is a priority.

With all this being said, integrating vendor-made (either ready-made rules that come with SIEM tools or outsourcing their implementation) into creating detections is perfectly viable. However, this is true only as long as you have a strong grasp of the end-to-end process, and understand technical objectives very well. Only then will the external rules fit in your environment without adding burden…

Related blog:

Build for Detection Engineering, and Alerting Will Improve (Part 3) [Medium Backup]

 This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

In this blog (#3 in the series), we will start to define and refine our detection engineering machinery to avoid the problems covered in Parts 1 and 2.

Adopting detection engineering practices should have a roadmap and eventually become a program, effectively re-balancing where efforts go in a SOC by investing in high quality detection creation (and detection content lifecycle, of course).

Put simply, if you spend more time building better detections, then you spend less time triaging bad alerts. Simple, eh? If it were simple everybody would do it!

Embracing leaner, consistent, purpose-driven detection workflows is key, and you may want to assert where you land on these key areas:

⚒️ Breakdown and Backlog: Build a continuous roll of issues corresponding to threats to analyze, and detection requirements to implement. What you are doing next for detection content should be clear in most cases, and yes, this is security, so there will be nasty surprises. Eventually, the only unpredictable tasks would be the genuine rare surprises — your routine detection work would not surprise you.

🌊 Smoothen yer workflow : Remove people interfaces that don’t work, define minimal ceremonies, and put content reviews in the right places. Shorten approval times for releases, ensure detection quality reviews are followed up. If dealing with a MSSP/MDR, make sure to lay the governance structure for building custom content jointly with you (JointOps, not finger-pointing).

☣️ Embrace threat-driven approach: Study adversary tradecraft in detail before making educated calls on what to detect, and where/how. Starting from available telemetry data will more often than not be prone to bias, inefficiency and mistakes.

💡 Embed Intel: A CTI enclave in a SOC will often provide higher ROI than dealing with a separate team, as they understanding SOC needs better, and plug in directly into their processes (this is full of nuance, so YMMV)

⚡ Lower Intel-to-Rule KPIs : Quantify how long it takes to go from intel input (ItR for Intel to rule, ha-ha, we just made up a new acronym! Take that, Gartner! ;-)) to an actual detection; with as much granularity as possible. An good ItR metric would be to transform a high risk threat intel into working detections in hours or in a few days.

👀 Visibility over assumptions : If you can not answer accurately within minutes what your detection coverage is and which shortcomings it has — you likely need to start parsing your detection library and threat modeling into qualitative metrics and make that data transparent to the DE team.

🚀 Release Soon, Release Often! : There are new threat variants every week, and detection engineering scales directly on the quality of the intelligence input. This is where modern software engineering practices come handy.

🔥 Quantify, Measure, Orient Operations : Define what healthy operations look like : what FP rate is acceptable for new detections ? What turnover time for tuning should be aimed at ? Where are quick wins, where are detection gaps ? Where is capacity spent, should it be reassigned to more urgent priorities ? Where are process bottlenecks ?

🦾 Automate the hard — and boring — part : Everything produced during R&D should generate rich and structured knowledge bases, metadata, and metrics. Version your detection library, and roll it out with CI/CD toolings. BTW, this advice alone is worth the price of this blog!

💎 Where does ATT&CK fit in the DE picture ?

While a great model to categorize adversary tradecraft, and a necessary tool in the detection engineering arsenal, ATT&CK is often overused as a palliative measure to map detections and generate coverage maps — by skipping the detail (“Do you cover T1548? — Ehhh… YES?”)

This does not accurately represent SOC detection performance, since techniques can be fairly broad (since it is the purpose of the taxonomy to normalize specificities), and rule quantity doesn’t equate quality (not when they drown analysts in False Positives).

Detection hints from ATT&CK are also rather generic, since a Technique is itself a concept which clusters different procedures together. Thus, while ATT&CK can give a direction of what a SOC needs to develop, it doesn’t give a way to achieve detection objectives; which is the detection engineer core concern.

The journey to Detection Engineering maturity is hard, but you should now have a clearer perspective to smoothen the journey toward building better detections.

But it all starts with quality input : in our next blog post, we’ll look at in more detail what exactly a Detection Engineering team needs from Threat Intelligence to be fully informed, and propose collaborative models. Stay tuned!

Related blog posts:

Focus Threat Intel Capabilities at Detection Engineering (Part 4) [Medium Backup 10/24/2023]

 

This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

In this blog (#4 in the series), we will start to talk about the elephant in the room: how intel becomes detections (and, no, it is not trivial)

Detection Engineers are often picky with the intelligence they receive … and for good reasons. Incomplete, too high-level or overly specific data leads to long analysis time, bias and ultimately inconsistent detection quality and detection coverage gaps. On the other hand, the intel team may process so much intelligence data that it can be a real challenge to understand what the detection team is even expecting. So, in many cases there is an “intelligence — detection gap”, of sorts. This is our topic in this part of the blog series.

To start, in the Detection Engineering Maturity Matrix by Kyle Bailey, intel is referenced on the upper maturity level of the “Threat Operations” Category — and historically this has held true. In some cases, Detection Engineering (DE) teams do not get valuable threat intel at all, and have to do their own threat research as a result (if this is not like that in your organization, congratulations!). But only working in tandem allows us to reach best-in-class results for detection!

Today, let’s explore what’s wrong (for detection, specifically!) with much intelligence data provided, what’s good intel (for DE purposes), and how we can work better together (but do look at the previous posts Part 1, Part 2, Part 3 even if this one is most interesting for you…)

Intelligence is Often Lacking in TI

A frequent observation in CTI and DE team relationships is that DE is asking CTI what they should focus on, what threat they should address first, and CTI cannot deliver precise answers beyond a bundle of ATT&CK techniques (which are largely the same for the large majority of threat actors and organizations). As we’ve seen in our previous blog post, ATT&CK is not sufficient for more technical DE needs and for detailed planning of the priority detection engineering activities.

In some lucky cases, useful intel is often forwarded directly without processing. Similarly, if the IR team is effective and mature, they will send useful data back to CTI and DE teams. In this case, naturally, it’ll always be after the fact (i.e. after an incident), where the DE target is to develop ahead of an incident.

These observations beg the question — what is good intel for detection engineering purposes? For the most part, we can derive some common bad practices:

  • 💅 Just ATT&CK techniques: just giving a set of technique ID to a DE is not valuable intel (“Attackers used T1548“ — “Great! What do we do with it!”) : it may take weeks in best cases to properly develop detections for the scope of a single technique on one platform, and technique-level detections are very hard to reach without resorting to “false positive”-prone approaches. Realistically, most of the time, only well-known procedures will be covered.
  • 🧐 Generalist “intel”: everyone knows brute force is a bad thing — but what did the actor do exactly that we could detect? DE teams need technical threat descriptions. There is a large set of ways to brute force something and an ever larger set of ways to detect the resulting activities.
  • 💔 Hard to parse with detection in mind: Nicely documented CTI reports have a bad tendency of containing the entire attack path and a lot of other exciting yet irrelevant for detection details. For DE teams, that now means breaking down the report into a dozen singular threat elements that each needs to be evaluated and processed into detection.
  • 🍂 Lack of variety: Yes, Email and Windows are exploited left and right. But we also have Cloud, SaaS, Remote Workers and other unique threat vectors to look into. If your company uses multiple platforms, but your intel team feeds your DE team with Windows stuff only, attackers will get in and you won’t know it.
  • ⚖️ No value add, just copy from the feed: CTI ultimate purpose in an organization is not to compile external content (some of questionable provenance), but to analyze threats through the prism of your business operations and IT systems in place. “Detect threats that matter” is easy to say, but really hard to do!
  • 🫠 Doesn’t understand the threat: An intel analyst should deeply understand the threat they communicate to DE to accelerate further processing, and ensure enough information is delivered. DE is a process, where any threat can be processed into detection, but only as long as we understand the issue at hand. Ideally, bi-directional CTI-DE comms are established…
  • 🐢 Sloooow: DE aims at moving at the speed of intelligence, thus it needs intel which is closely following the latest developments. Intel done for research can be slow, but intel done for detection cannot be.

🧬IOCs only get you so far

CTI teams spend a considerable amount of money building an infrastructure that captures, stores, and distributes IOCs to various intel consumers (including whatever detection platforms you use, SIEM, EDR, etc). For detection engineers, handling those IOC to scan incoming logs, search historical ones or enrich existing alerts is a very efficient catch-all. However, most threats nowadays are highly effective at avoiding simple markers and naive rules.

Modern malware, domain generation algorithms and other approaches introduce fuzziness in the behavior monitoring systems will see, and require different approaches to be effectively detectable. Starting from studying the exact procedures seen in the wild, we can model the common behaviors into low to high level detection objectives that remain highly targeted (and thus without resorting to trying to detect an entire ATT&CK technique, which may be costly and lengthy compared to the detailed intel received).

For example, a registry key used for persistence may change continuously and not be a great way to detect, but the presence of certain patterns in the registry value (certain non-latin characters, mentions or prefixes), very indicative of a certain actor may be great detection objective to successfully and quickly detect an entire campaign. But coming up with this takes skill, time and good intel!

For CTI to deliver a higher value-add to creating new detections, it must have a stronger focus into compiling intelligence frequently, and delivering the correct level of technical knowledge to detection engineers. Level of abstractions matters more (“they brute forced” is too much abstraction, but “they tried this password 37 times” is definitely too little)

😤 Massive PDFs Ain’t Intelligent!

CTI teams are encouraged by their TIP (threat intelligence platform, which sources and catalogs third-party threat reports) to transform multiple PDF into a single one, presumably so that it would be easier for humans to read. This is perhaps a vestige or non-cyber intelligence, which was often compiled in long paper reports.

Such aggregated PDFs are, frankly, an annoyance for most detection use cases! In fact, for good DE teams, this is a superb method to make them go mad, as they parse the report in weeks (yes, weeks) into individual objects, then research each one of them to identify detection opportunities. It discourages prioritizing particular aspects of the report, since the entire analysis blocks a DE, which then needs to present and act upon a large body of work, instead of neatly packaged work items. Consider that intel is only “intelligent” if it is processed in a way that doesn’t require further manual work and more analysis to understand.

Structuring expert teams

Detection Engineering is such a novel function that the reality is simply that many CTI teams are unprepared for the additional work needed to support such a function that intends to build on intelligence, rather than simply read it.

As a result, many DE teams today actually work very independently, combining threat intelligence and threat modeling into their processes, where in a target operating model, both of these inputs should be expected from threat intelligence analysts. In many cases, when a CTI team exists, they can’t scale to the DE challenge (and, yes, there are beautiful examples of effective CTI-DE collaboration at scale).

Creating new services in the CTI functions thus requires thinking about the organization structure, and how to make teams integrate. Next in the series? An operations model! Very fun!

Related blog posts:

How to Banish Heroes from Your SOC? [Medium Backup 10/12/2023]

 This blog was born from two parents: my never-finished blog on why relying on heroism in a Security Operations Center (SOC) is bad and Phil Venables “superb+” blog titles “Delivering Security at Scale: From Artisanal to Industrial.”

BTW, what is heroism? Isn’t that a good thing? Well, an ancient SRE deck defines “IT heroism” as relying on “individuals taking upon themselves to make up for a systemic problem.” As those who have seen the inside of a SOC can attest, this is, ahem, not entirely uncommon in many Security Operations Centers.

If you recall our Autonomic Security Operations (ASO) vision, we advocate for automation, consistent processes and systematic, and engineering-led approach to problems. Yet in real life heroes are very much needed at many SOCs for their routine operation. This is the essence of our conundrum: human heroism is usually good, but a system that relies on heroes for routine operation is bad.

Here is a great quote from another domain that explains this even better:

The need for heroism is revealing the fact that you haven’t scaled your organization’s processes to effectively withstand the brunt of the unexpected, leaving it on individuals to bear.” (source)

Is your SOC such a system? If yes, how to change it?

First, where might this show up in your SOC?

  • Heroic alert triage where analysts stay late, extend their shifts, accept escalations at all hours, etc (likely the most common example, frankly)
  • Heroic rule writing where rules and content gets created, instead of a detection engineering practice you have a detection firefighting crew…
  • Heroic remediation is the classic “wait, wait, I can fix it” syndrome that, statistically speaking, very rarely leads to a good solution.
  • Another classic: working long hours to resolve an incident alone.
  • Frequently coming up with creative one-off solutions to wide-ranging systemic problems.

What do you want instead? Well, you want an industrial system! What is it? Here, Phil explains it better than I can:

source: Phil’s blog https://www.philvenables.com/post/delivering-security-at-scale-from-artisanal-to-industrial

Now, let’s see if we can quickly contextualize it for SOC

source: I just made it :-)

Notice that the heroism makes many appearances in Phil’s “artisanal” side of the table. ”Dependent on individual artisans [read: heroes] to sustain work”, “Organization success is like spinning plates, if the people don’t show up there’s immediate and catastrophic failure“, “Hard to replicate” all carry the unmistakable mark of an IT hero…

OK, gimme some good news! How to fix it?

Trigger warning: this is going to be scary.

Ready?

source: privately shared

Now for the painful, painful truth: “It’s better to let a process break and uncover a systemic issue (like the need for better tooling or an adjustment of priorities), than to have individuals try to make up for the problem.“

You want more? Sorry, all I got is this ;-) Definitely more thinking and learning is required.

Now a question: have you successfully industrialized or “de-hero-ized” your SOC? Have you used our ASO ideas? What are the lessons? Insights? Key hurdles?

Related blogs:

Dr Anton Chuvakin