Last updated at Fri, 11 Aug 2017 19:23:40 GMT
Analyzing System Performance Using Log Data
Recently we examined some of the most common behaviors that our community of 25,000 users looked for in their logs, with a particular focus on web server logs. In fact, our research identified the top 15 web server tags and alerts created by our customers—you can read more about these in our https://logentries.com/doc/community-insights/ section—and you can also easily create tags or alerts based on the patterns to identify these behaviors in your systems log data.
This week we are focusing on system performance analysis using log data. Again we looked across our community of over 25,000 users and identified five ways in which people use log data to analyze system performance. As always, customer data was anonymized to protect privacy. Over the course of the next week, we will be diving into each of these areas in more detail and will feature customers' first-hand accounts of how they are using log data to help identify and resolve issues in their systems and analyze overall system performance.
Our research looked at more than 200,000 log analysis patterns from across our community to identify important events in log data. With a particular focus on system performance and IT operations related issues, we identified the following five areas as trending and common across our user base.
5 Key Log Analysis Insights
1. Slow Response Times
Response times are one of the most common and useful system performance measures available from your log data. They give you an immediate understanding of how long a request is taking to be returned. For example, web server log data can give you insight into how long a request takes to return a response to a client device. This can include time taken for the different components behind your web server (application servers, DBs) to process the request, so it can offer an immediate view of how well your application is performing. Recording response times from the client device/browser can give you an even more complete picture since this also captures page load time in the app/browser as well as network latency.
A good rule of thumb when measuring response times is to follow the 3 response time limits as outlined by Jakob Nielsen in his publication on ‘Usability Engineering' back in 1993 (still relevant today). 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, and 10 seconds is about the limit for keeping the user's attention focused on the dialogue.
Slow response time patterns almost always follow the pattern below:
In this context, response_time is the field value representing the server or client's response, and ‘X' is a threshold. If this threshold is exceeded, you want the event to be highlighted or to receive a notification so that you and your team are aware that somebody is having a poor user experience.
2. Memory Issues and Garbage Collection
Outofmemory errors can be pretty catastrophic, as they often result in the application's crashing due to lack of resources. Thus, you want to know about these when they occur; creating tags and generating notifications via alerts when these events occur is always recommended.
Your garbage collection behavior can be a leading indicator of outofmemory issues. Tracking this and getting notified if heap used vs free heap space is over a particular threshold, or if garbage collection is taking a long time, can be particularly useful and may point you toward memory leaks. Identifying a memory leak before an outofmemory exception can be the difference between a major system outage and a simple server restart until the issue is patched.
Furthermore, slow or long garbage collection can also be a reason for a user's experiencing slow application behavior. During garbage collection, your system can slow down; in some situations it blocks until garbage collection is complete (e.g. with ‘stop the world' garbage collection).
Below are some examples of common patterns used to identify the memory related issues outlined above:
- Out of memory
- exceeds memory limit
- memory leak detected
- memwatch:leak: Ended heapDiff
- GC AND stats
3. Deadlocks and Threading Issues
Deadlocks can occur in many shapes and sizes and can have a range of negative effects—from simply slowing your system down to bringing it to a complete halt. In short, a deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does. For example, we say that a set of processes or threads is deadlocked when each thread is waiting for an event that only another process in the set can cause.
Not surprisingly, deadlocks feature as one of our top 5 system performance related issues that our users write patterns to detect in their systems.
Most deadlock patterns simply contain the keyword ‘deadlock', but some of the common patterns follow the following structure:
- ‘Deadlock found when trying to get lock'
- ‘Unexpected error while processing request: deadlock;'
4. High Resource Usage (CPU/Disk/ Network)
In many cases, a slow down in system performance may not be as a result of any major software flaw, but instead is a simple case of the load on your system increasing without increased resources available to deal with this. Tracking resource usage can allow you to see when you require additional capacity such that you can kick off more server instances (for example).
Example patterns used when analyzing resource usage:
- metric=/CPUUtilization/ AND minimum>X
- disk is at or near capacity
- not enough space on the disk
- java.io.IOException: No space left on device
- insufficient bandwidth
5. Database Issues and Slow Queries
Knowing when a query failed can be useful, since it allows you to identify situations when a request may have returned without the relevant data and thus helps you identify when users are not getting the data they need. There can be more subtle issues, however, such as when a user is getting the correct results but the results are taking a long time to return. While technically the system may be fine (and bug-free), a slow user experience hurts your top line.
Tracking slow queries allows you to track how your DB queries are performing. Setting acceptable thresholds for query time and reporting on anything that exceeds these thresholds can help you quickly identify when your user's experience is being affected.
- SQL Timeout
- Long query
- Slow query
- WARNING: Query took longer than X
- Query_time > X
Continued Log Data Analysis
As always, let us know if you think we have left out any important issues that you like to track in your log data. To start tracking your own system performance, sign up for our free log management tool and include these patterns listed above to automatically create tags and alerts relevant for your system.
Ready to start getting insights from your applications? Try InsightOps today.