Last updated at Wed, 07 Apr 2021 18:31:33 GMT

Merry HaXmas to you! Each year we mark the 12 Days of HaXmas with 12 blog posts on hacking-related topics and roundups from the year. This year, we're highlighting some of the “gifts” we want to give back to the community. And while these gifts may not come wrapped with a bow, we hope you enjoy them.


Sam the snowman taught me everything I know about reindeer [disclaimer: not actually true], so it only seemed logical that we bring him back to explain the journey of machine learning. Wait, what? You don't see the correlation between reindeer and machine learning? Think about it, that movie had everything: Yukon Cornelius, the Bumble, and of course, Rudolph himself. And thus, in sticking with the theme of HaXmas 2016, this post is all about the gifts of early SIEM technology, “big data”, and a scientific process.

SIEM and statistical models – Rudolph doesn't get to play the reindeer games


Just as Rudolph had conformist Donner's gift of a fake black nose promising to cover his glowing monstrosity [and it truly was impressive to craft this perfect deception technology with hooves], information security had the gift of early SIEM technology promising to analyze every event against known bad activity to spot malware. The banking industry had just seen significant innovation in the use of statistical analysis [a sizable portion of what we now call “analytics”] for understanding the normal in both online banking and payment card activities. Tracking new actions and comparing them to what is typical for the individual takes a great deal of computing power and early returns in replicating fraud prevention's success were not good.

SIEM had a great deal working against it when everyone suddenly expected a solution designed solely for log centralization to easily start empowering more complex pattern recognition and anomaly detection. After having witnessed, as consumers, the fraud alerts that can come from anomaly detection, executives starting expecting the same from their team of SIEM analysts. Except, there were problems:

  • the events within an organization vary a great deal more than the login, transfer, and purchase activities of the banking world,
  • the fraud detection technology was solely dedicated to monitoring events in other, larger systems, and
  • SIEM couldn't handle both data aggregation and analyzing hundreds of different types of events against established norms.

After all, my favorite lesson from data scientists is that “counting is hard”. Keeping track of the occurrence of every type of event for every individual takes a lot of computing power and understanding of each type of event. After attempting to define alerts for transfer size thresholds, port usage, and time-of-day logins, no one understood that services like Skype using unpredictable ports and the most privileged users regularly logging in late to resolve issues would cause a bevy of false positives. This forced most incident response teams to banish advanced statistical analysis to the island of misfit toys, like an elf who wants to be a dentist.

“Big Data” – Yukon Cornelius rescues machine learning from the Island of Misfit Toys

There is probably no better support group friend for the bizarre hero, Yukon Cornelius, than “big data” technologies. Just as NoSQL databases, like Mongo, to map-reduce technologies, like Hadoop, were marketed as the solution to every, conceivable challenge, Yukon proudly announced his heroism to the world. Yukon carried a gun he never used, even when fighting Bumble, and “big data” technology is varied and each kind needs to be weighed against less expensive options for each problem. When jumping into a solution technology-first, most teams attempting to harness “big data” technology came away with new hardware clusters and mixed results; insufficient data and security experts miscast as data experts still prevent returns on machine learning from happening today.


However, with the right tools and the right training, data scientists and software engineers have used “big data” to rescue machine learning [or statistical analysis, for the old school among you] from its unfit-for-here status. Thanks to “big data”, all of the original goals of statistical analysis in SIEMs are now achievable. This may have led to hundreds of security software companies marketing themselves as “the machine learning silver bullet”, but you just have to decide when to use the gun and when to use the pick axe. If you can cut through the hype to decide when the analytics are right and for which problems machine learning is valuable, you can be a reason that both Hermey, the dentist, and Rudolph, the HaXmas hero, reach their goal.

Visibility - only Santa Claus could get it from a glowing red nose


But just as Rudolph's red nose didn't make Santa's sleigh fly faster, machine learning is not a magic wand you wave at your SIEM to make eggnog pour out. That extra foggy Christmas Eve couldn't have been foggy all over the globe [or it was more like the ridiculous Day After Tomorrow], but Santa knows how to defy physics to reach the entire planet in a single night, so we can give him the benefit of the doubt and assume he knew when and where he needed a glowing red ball to shine brighter than the world's best LED headlights. I know that I've held a pretty powerful Maglite in the fog and still couldn't see a thing, so I wouldn't be able to get around Mt. Washington armed with a glowing reindeer nose.

Similarly, you can't just hand a machine learning toolkit to any security professional and expect them to start finding the patterns they should care about across those hundreds of data types mentioned above. It takes the right tools, an understanding of the data science process, and enough security domain expertise to apply machine learning to the attacker behaviors well hidden within our chaotically normal environments. Basic anomaly detection and the baselining of users and assets against their peers should be embedded in modern SIEM and EDR solutions to reveal important context and unusual behavior. It's the more focused problems and data sets that demand the kind of pattern recognition within the characteristics of a website or PHP file only the deliberate development of machine learning algorithms can properly address.