As a long time APM guy, and for someone who has spent quite a bit of time building monitoring tools such as transaction tracers, memory analyzers, tools for analyzing garbage collection behavior etc. it pains me to say this, but sometimes an APM solution is not enough.
In fact, a recent blog I read by Andreas Grabner of Compuware entitled “Don‘t Trust Your Log Files: How and Why to Monitor ALL Exceptions” got me thinking on this topic. Grabner* claims* that you shouldn’t trust your log files because if you do not catch and log your exceptions you are not going to see them in your logs. I think it’s fair to say, and pretty obvious, that if you do not log an exception you are not going see it. However, on the flip side we are seeing more and more users (and we have over 25,000 of them) who are getting real value from logging for exactly this use case. In fact a recent survey we carried out points to production monitoring and production troubleshooting as the top two ways that our users are utilizing their logs! And if you are smart about how you log, you do not need to have verbose logging turned on to catch key issues in your system. Logging can be quite simple, and deliver value that is much more specific and granular than basic APM.
So my question is: why are people turning to logs more and more for production monitoring and troubleshooting in addition to their expensive, existing monitoring tools? In short I believe that systems have fundamentally changed over the past few years. If you think back even five years ago systems were mainly on-premise or in a data center where you had complete control of your environment. However, today it is becoming more and more common for systems to be deployed entirely on the cloud or to at least make use of numerous cloud components. For cloud-based systems full instrumentation, or the use of APM solutions is often not an option since many parts of the stack may no longer be under your full control; the access required to apply an APM solution may not be available. For example, with IaaS you only have access from the operating system and up, i.e. the operating system, the middleware and application tier. The provider will control everything below the operating system such as the hypervisor layer, the hardware and the network. For those using PaaS, the situation is even more constrained since PaaS vendors tend to manage the OS and middleware components on behalf of their users. You, therefore, only have access to the application tier from an instrumentation perspective. Finally, with SaaS components, you generally do not have any ability to instrument and are required to rely on any instrumentation APIs or endpoints provided by the SaaS vendor.
As a result, it is common for traditional APM solutions, which have claimed in the past to gave 100% end-to-end visibility for on-premise systems, to only give a fraction of that visibility for cloud-based systems. It is difficult to instrument the cloud and, thus, alternative approaches are required to give visibility into cloud-based components which otherwise can become black boxes from a performance or system monitoring perspective.
While it can be difficult to instrument cloud-based components, in general they tend to produce log data streams or provide access to APIs that can be polled to generate data streams. These data streams can be analyzed to give visibility into your systems:
- SaaS: hosted services such as database as a service, email as a service, as well as modern day CDNs provide log forwarding and event metrics via APIs that can be consumed by Log Management and Analytics solutions to provide visibility from a KPI tracking perspective.
- PaaS: PaaS vendors (e.g. Heroku, Cloud Foundry, and Engineyard) provide log data that can contain both application and system level log events, as well as run time metrics from the PaaS middleware that give detailed performance and error tracking information into your system. Heroku, one of the leading PaaS vendors, showcases how log data can be utilized in their Log2Viz open source project which provides a performance dashboard based on real-time analysis of performance metrics recorded in Heroku logs.
- IaaS: IaaS providers tend to provide monitoring APIs (e.g. Rackspace and Amazon CloudWatch). Such APIs can be polled to provide a stream of performance information on cloud server instances as well as on the different services provided by the IaaS vendor.
While instrumentation may not be possible, the existing log data and API data streams provided by cloud vendors can be analyzed by log management and analytics solutions to provide real- time KPI dashboards giving deep visibility into what are often otherwise perceived as black box components.
While APM solutions are definitely being widely used for a view into how your own application code is performing, in many cases APM alone is not enough to give an end to end perspective of your system. We are seeing a huge increase in users sending more and more performance metrics into their log data – giving them the ability to use “logs as data” along side their APM solution, providing some of the deep insights that they really need.
Would love to hear your thoughts. How can log management & analytics further complement a traditional APM solution?