System Monitoring and Troubleshooting

At a Glance:

System monitoring and troubleshooting is a fundamental component of an IT team’s responsibilities. While compliance frameworks like NIST and ITIL can offer guidelines for monitoring, these standards can often leave a lot of room for interpretation, and implementing a monitoring strategy can be daunting. The below sections provide an overview of the who, what, where, when, and how of monitoring your IT environment.

Data Types to Monitor

One way to think about monitoring your environment is to consider data in three categories.

First is log data, which can be defined as any data written to a log file, regardless of whether it’s a common structure or simple text. Log data provides a detailed record of the transactions occurring across your IT environment. Second is asset data, which refers to any data taken directly from the asset. This can range from basic resource metrics like CPU and memory to information about the processes and applications running on a given IT asset. Asset data can be particularly useful when monitoring for events that wouldn’t normally be captured in standard log files. Lastly is network data, which refers to data that’s specific to network performance, including bandwidth, network connection details, and routing behavior.

While monitoring all three of these data types is fundamental to mature IT operations, system monitoring typically focuses on the analysis of log data and asset data, specifically.

Systems to Monitor

There are a lot of systems you could potentially monitor, and the ones you select will ultimately depend on your environment. Options may include:

Servers: Server monitoring covers a broad range of systems, including servers hosting applications, Active Directory Domain Controllers, file shares, and email servers. Whether it’s a Windows, Linux, or Mac machine, most servers will offer some degree of event logging.

Databases: Many databases offer different logging levels to help administrators debug errors and identify issues that are on the horizon. Typical events logged from databases can include slow queries and SQL timeouts, row limits, memory limitations, and cache issues.

Applications: Applications include both third-party applications you’ve purchased and applications that have been developed in-house. Some third-party applications will write logs to their host, which can then be collected. Applications developed by your internal team should also be built to log important events that can be captured. Consider whether these applications are customer-facing or employee-facing. While application performance monitoring is important regardless of application audience, customer-facing applications and services may deserve more verbose logging.

Cloud Services: Cloud services, especially infrastructure-as-a-service solutions like AWS and Azure, are instrumental to a system monitoring plan. These services may offer log viewing functionality within the service itself, but you can also collect and store logs outside of these services. Collecting and storing all of your logs in a single location can make it easier to find information later.

Containers: Containerization is becoming a popular approach to architecting and hosting both applications and infrastructures thanks to services like Docker. As infrastructure becomes more compartmentalized, more ephemeral, and more dependent on code than physical machines, monitoring containers can play a role in system health.

Employee Workstations: When software or processes on an employee’s machine are in conflict or perhaps flooding your network with packets, being able to see what’s running on an employee’s workstation is necessary. It’s important to be able to do this remotely, as tracking down the physical asset can be time-consuming or not feasible.

Events and Metrics to Monitor

Errors: Logging application and system errors is an easy choice, and the keyword “error” often serves as a good starting place for IT investigations. Some systems categorize errors by type, which can provide indications of which events to pay attention to. 

CRUD events: In general, capturing when information is created, read, updated, or deleted can be useful for debugging issues later, especially in applications. While these events won’t often provide direct indications of an issue, they can be excellent sources of information when tracing an issue back to its root cause.

Transactions: “Transactions” often refer to important events like purchases, subscriptions, cancellations, and submissions. Individual transactions should be closely monitored for failed transactions and incomplete transactions. Depending on the system, error codes may get logged with important information on what caused the transaction issue. Some systems, like Microsoft SQL Server, provide a dedicated Transaction Log for capturing this information in one log. In other systems, you may need to centralize this information yourself.

Access requests and permission change: Logging from a service like Active Directory can offer an important view of user behavior in your environment. Monitoring and collecting data on things like permission changes can help you prevent users from getting unintended admin rights. This type of monitoring is often necessary to meet certain compliance standards

System metrics: Systems metrics like CPU, memory, and disk utilization should be closely monitored at all times to prevent system failure. Dramatic changes in these values could indicate an outage or an impending outage. Collecting these metrics over longer periods of time can also help with capacity planning in the future.

How to Monitor

Given the breadth of systems, events, and metrics to be monitoring, centralizing your data collection into a single source of truth may be a good decision, especially if a system were to go down. There are log management solutions available to collect, centralize, and organize logs in a way that makes them easy to search, visualize, and generate alerts from.

Monitoring may also expand beyond log management to include the monitoring of individual IT assets. This type of monitoring includes the ongoing measurement of resource utilization metrics and the tracking of software and processes running on the assets. Software usage isn’t often captured in traditional logs but can offer vital clues to the root cause of system issues. Being able to not only measure IT asset data but log the results offers significant visibility across your IT environment.

When to Monitor

In short, system monitoring should be happening 24/7 if your systems needs to maintain constant availability. Often, monitoring can happen in the background without you needing to pay constant attention. With that said, there are some occasions when you should expect to keep an active eye of your system data, including:

System Updates: Anytime a system is being updated, there runs the risk of the update failing or the update causing unintended complications.

Application deployments and rollbacks: When deploying code (or rolling-back code) to applications, there could be unexpected issues, even if all unit tests and acceptance tests pass.

Migrations: Data migrations can be often be challenging and present issues with mismatched data types, validation issues, and more.

Peak transaction times: Some businesses have known periods of increased transactions, such as e-commerce companies during holidays or sales. Availability issues occurring during these peak times could have significant consequences if not caught quickly.

There are a lot of factors that go into IT system monitoring and troubleshooting. By breaking down your IT environment into which systems and events you need to monitor, you’ll be one step closer to determining the best monitoring strategy and solution for your organization.