Nearly a year ago, I likened the incident handling process to continuous flow manufacturing to help focus on the two major bottlenecks: alert triage and incident analysis. At the time of these posts, most incident response teams were just starting to find value amid the buzz of “security analytics,” so I focused on how analytics could help those struggling with search alone. As more teams have brought the two capabilities together, the best practices for balancing them to smooth the process have started to become apparent, so here is what we've heard thus far.
Analytics are meant to replace slow and manual search—but not everywhere
You need to be able to use search as if you're zooming in to find the supporting details for your initial conclusion. It would be great if we could solve every incident like Ace Ventura deducing that Roger Podacter was murdered by remembering a string of seemingly unrelated data points, but as Don Norman explained in The Design of Everyday Things, the limitations of the human working memory demand better designed tools to balance the trade-off between “knowledge in the head” and “knowledge in the world”. Since investigating an incident involves a great deal of information be gathered, “knowledge in the world” must be maximized through analytics and effective visualizations for the context they bring to minimize the necessary “knowledge in the head”.
Search cannot be your only tool for alert triage. Search results provide immediate "knowledge in the head", but little else. They are great for answering clear and pointed questions, but there is not enough context in search results alone. Receiving an alert is the beginning of the incident handling process and if the alert is nothing more than a log line, triage can be a gut feeling based only on previous experience with the specific alert triggered. Sometimes, seeing the log line in the context of the events immediately preceding and following it is enough to understand what occurred, but often, higher level questions need to be asked, such as “who did it?”, “on which asset?”, and “is this common for that person and system?” None of these questions immediately explain the root cause of the triggered alert, but they make it exponentially easier to triage. If you need to manually search your machine data for the answer to every one of these questions to triage every alert, you need to provide your team with a great deal more “knowledge in the world”. This should include manicured views containing the answer to most of these questions on display next to the alert, or a simple click away.
If you are analyzing dozens of incidents per day, every repetitive action is building the bottleneck. This is where analytics play a part in the incident handling process, as they only add value if the conclusion they make is easy to understand. Dozens to hundreds (and hopefully not thousands) of alerts per day require a progression of quick conclusions which others have made in the past before finding a very specific answer to the incident at hand. Easy-to-understand analytics are the answers to the many early questions you need to ask to before you understand the root cause of the alert and decide whether it was either malicious or an action which should be permitted in the future. If the analyst performing triage and the incident analyst (who may be the same person) have access to these analytics as soon as they receive the potential incident in their queue, it vastly reduces the amount of manual searching through data they must perform to close the incident faster than previously possible.
It hurts the flow when you cannot immediately access the data you need
In the previous post of this series, I covered the need to collect data outside (and inside) the traditional perimeter. An adjacent problem is the situations in which the data is being collected but not readily made available to the IR team. Needing to wait for the data to be restored or indexed is a bit like if you were playing Clue with your family and every time someone said "It was Colonel Mustard, in the conservatory, with the candlestick", you all took out your phones and caught up on emails for thirty minutes as the player-to-the-left searched through filing cabinets for her cards disproving the suggestion.
Expensive tools and verbose data sources have led too many teams to delay investigation until the data is acquired for indexing. The first common cause of this challenge is caused by the realities of departmental budgets, which for the majority of teams to pick and choose the data they make unconditionally available to search. Sadly, this leads many teams to choose between firewall, DNS, or web proxy data to have immediately searchable for incident responders, storing the rest on the device itself or pushing it to cold storage. Then, when an incident is deemed severe enough, the team is waiting hours to restore (or forward) the data and get it indexed for search. This sounds crazy to anyone who hasn't had to stay under a budget, but these three devices are logging so much noise in a typical day that it becomes the only perceived sacrifice the team can make. Your team shouldn't have to worry about exceeding data thresholds or adding more members to your cluster just to have this key data constantly available.
Setting reminders to search the data once it's available severely challenges the "without interruption" aspect of continuous flow. The second guaranteed "data loading..." way to slow your team's response to an alert is using technology that takes up to twenty minutes to make it searchable. If receiving an email alert and copying it to your calendar to remind you to investigate in twenty to thirty minutes doesn't sound like the best use of your time, you are likely to be frustrated by the inability to search at the moment an incident has your attention. Unless you're the one incident responder who never has enough to do, you'll probably be busy looking into something else by that time, meaning an even longer gap before the initial alert is investigated, or worse, it gets forgotten altogether until you or someone else on the team reviews the open incidents hours later.
Everything from the query language to the search results needs to be designed for junior analysts
No matter the maturity of your team, the best search and analytics capabilities can offer little value if they don't account for a single truth: you have to hire and train entry-level analysts just as any technical team does. If every tool at your team's disposal is designed with a focus only on the type of data it obtains and not on the usability, your junior analysts have steep learning curves to climb before they are contributing to the efficiency of the entire process. This is not the new hires' faults. If you need to learn a complex query language, become and expert on each data type, or decipher the search result screens which perfectly depict "information overload" before you can contribute to the team, you need to get more comfortable blaming the software at your disposal. Going back to Don Norman, we have all become too accustomed to blaming the user when the design is at fault. Every software solution (for search or otherwise) your team uses needs to be designed for an entry-level analyst.
If searching through machine data is your current need, you can start a free trial of Logentries here and search your data within seconds. If you want to learn more about building your incident response capabilities, check out our incident response toolkit.