Organizations are taking too long to detect attacks. The average length of time to detect an intruder ranges from 2 months to 229 days across many reports and anecdotal evidence from publicized breaches supports these numbers. This means that attackers are taking advantage of the challenges inherent to the flood of information bombarding your incident response team every day. This is a problem that we need to address by improving the process with better tools.
The incident handling process is similar to continuous flow manufacturing
Continuous flow manufacturing has various nodes that are continuously receiving an item and passing it on to the next stage. Incidents follow a similar pattern:
Alerted to potential incident --> Alert is triaged to strip false positives --> Incident is analyzed --> Action is taken to remediate
An incident is not detected simply because an alert is triggered. It needs to reach the final stage where the issue is confirmed and the impact is understood before detection can be claimed.
Because of this similarity, it reminds me of a mandatory reading in most business schools, The Goal. This book looked at the manufacturing process according to a "Theory of Constraints", which views any manageable system as being limited in achieving more of its goals by a very small number of constraints. These constraints are also commonly referred to as bottlenecks because no matter how many items look to flow through them, there is a limit to how much can pass at any one time (like the neck of a bottle). If you want to respond to incidents faster and speed attack detection, you need to make sure that each stage of your incident handling process is not a daunting bottleneck. The problem is that, in most organizations, there are two bottlenecks that need to be eliminated if we are to improve response time from months to hours.
IR Bottleneck #1: Alert triage is a decision without time for proper deliberation because of the noise
All too often, a significant incident response bottleneck is the first human action: alert triage. Whether a 24-hour SOC or an ad hoc small team, the number of noisy security tools currently in use causes CSIRTs to see thousands of alerts each day and that is causing them to both:
(a) ignore a great deal of alerts based on extremely quick decisions with little context, and
(b) take a long time to parse the alert and decide to confirm it as an incident and begin the investigation.
In every massive breach made public, someone in the media jumps to point out that an alert was triggered, but no action was taken. I wrote about alert fatigue a great deal here, but the security solutions that value quantity of alerts over quality of alerts are as much to blame for incident response teams missing the right alerts as the teams and processes built around those tools. If too many alerts are being generated for a team to take a minute apiece to understand, the true incidents are never going to get investigated quickly enough.
The InsightIDR team invests heavily in introducing new user behavior analytics detection capabilities that are designed to alert infrequently. We are also trying to ingest the alerts from your other security products and reduce the noise by adding context about the responsible user and trends within the data generated. Our goal is to reduce the number of alerts on such a scale - from thousands per day to tens per day - that you can triage them all without rushing and have time for your many other responsibilities. Our customers have told us that we are delivering on this promise with 20 to 50 alerts per day in the larger organizations, but our aim is to continue to keep the volume low while detecting more and more concerning behaviors.
IR bottleneck # 2: Incident analysis requires a very high level of skill due to available tools
Even more frequently, the investigation process for an incident takes a great deal of time. Taking an incident and tying all related information together to paint a picture of the broader compromise and map out the impact is primarily done by taking one piece of information, such as an IP address, from a noisy alert and conducting dozens of searches through mounds of collected data. There are three reasons that this frequently becomes a bottleneck:
- Not having the right data - Given the bottleneck in alert triage, not every incident is identified within thirty days, but most solutions only keep data searchable for this period of time. If you can obtain data from a date far enough in the past, every attack follows a different path, so you may not have collected from the sources (i.e. endpoints) necessary to map out an intruder's path through your network.
- Not knowing what questions to ask of your data - There is a well-known concern that a very small number of highly skilled incident response professionals know what to look for in advanced attacks. This means that either a select few organizations have multiple experts at incident analysis or a large number of organizations have one expert. The truth is closer to somewhere in the middle, but without improving the tools and methods, the learning curve to create more of these experts has a troubling slope.
- High speed answers without the right questions - Existing software solutions have focused too much on speeding manual searches through the data and too little on providing access to noteworthy patterns and relevant context in the search results that will help less experienced incident responders gain the understanding necessary to move to remediation faster.
InsightIDR's provides a variety of tools for our customers to immediately query endpoints, search logs, or explore patterns in behavior to reduce this sizable investigation bottleneck because we feel that we need to provide better ways to analyze data from any time or source and tie it to the users and assets involved so that you can fully scope the incident and determine the impact to the organization.
If you want to see how InsightIDR can help you minimize your bottlenecks, you can watch our brief video or register for a free, guided demo of InsightIDR. I think you'll quickly see how we can improve your incident handling flow.