Last updated at Mon, 06 Nov 2017 17:53:30 GMT

As a long time APM guy, and for someone who has spent quite a bit of time building monitoring tools such as transaction tracers, memory analyzers, tools for analyzing garbage collection behavior etc. it pains me to say this, but sometimes an APM solution is not enough.

In fact, a recent blog I read by Andreas Grabner of Compuware entitled “Don‘t Trust Your Log Files: How and Why to Monitor ALL Exceptions” got me thinking on this topic. Grabner* claims* that you shouldn’t trust your log files because if you do not catch and log your exceptions you are not going to see them in your logs. I think it’s fair to say, and pretty obvious, that if you do not log an exception you are not going see it. However, on the flip side we are seeing more and more users (and we have over 25,000 of them) who are getting real value from logging for exactly this use case. In fact a recent survey we carried out points to production monitoring and production troubleshooting as the top two ways that our users are utilizing their logs!  And if you are smart about how you log, you do not need to have verbose logging turned on to catch key issues in your system. Logging can be quite simple, and deliver value that is much more specific and granular than basic APM.

Use your logs to ‘log all the things’ especially important exceptions

So my question is:  why are people turning to logs more and more for production monitoring and troubleshooting in addition to their expensive, existing monitoring tools? In short I believe that systems have fundamentally changed over the past few years. If you think back even five years ago systems were mainly on-premise or in a data center where you had complete control of your environment. However, today it is becoming more and more common for systems to be deployed entirely on the cloud or to at least make use of numerous cloud components. For cloud-based systems full instrumentation, or the use of APM solutions is often not an option since many parts of the stack may no longer be under your full control; the access required to apply an APM solution may not be available. For example, with IaaS you only have access from the operating system and up, i.e. the operating system, the middleware and application tier. The provider will control everything below the operating system such as the hypervisor layer, the hardware and the network. For those using PaaS, the situation is even more constrained since PaaS vendors tend to manage the OS and middleware components on behalf of their users. You, therefore, only have access to the application tier from an instrumentation perspective. Finally, with SaaS components, you generally do not have any ability to instrument and are required to rely on any instrumentation APIs or endpoints provided by the SaaS vendor.

As a result, it is common for traditional APM solutions, which have claimed in the past to gave 100% end-to-end visibility for on-premise systems, to only give a fraction of that visibility for cloud-based systems. It is difficult to instrument the cloud and, thus, alternative approaches are required to give visibility into cloud-based components which otherwise can become black boxes from a performance or system monitoring perspective.

While it can be difficult to instrument cloud-based components, in general they tend to produce log data streams or provide access to APIs that can be polled to generate data streams. These data streams can be analyzed to give visibility into your systems:

While instrumentation may not be possible, the existing log data and API data streams provided by cloud vendors can be analyzed by log management and analytics solutions to provide real- time KPI dashboards giving deep visibility into what are often otherwise perceived as black box components.

While APM solutions are definitely being widely used for a view into how your own application code is performing, in many cases APM alone is not enough to give an end to end perspective of your system. We are seeing a huge increase in users sending more and more performance metrics into their log data – giving them the ability to use “logs as data” along side their APM solution, providing some of the deep insights that they really need.

Would love to hear your thoughts. How can log management & analytics further complement a traditional APM solution?