3 Steps to Building an Effective Log Management Policy

You’re on Call Duty. You’re awoken in the middle of the night by your cell phone in the throes of an SMS frenzy. You’re getting hundreds of messages from your company’s logging service: a record is being written to a database, code is being executed, a new container is being spun up, and on and on. None of these messages matter to you. You just turn off your phone and go back to sleep.

The next day you go into the office only to find out that half the racks in your datacenter went offline during the night. Thousands of customers were without service. You never got the FATAL message. How could you? Your phone was off. Still, you’re the one responsible. You’ll be lucky to keep your job. You do some snooping around only to find out that some junior system admin made it so that the logging service forwarded ALL log messages to your phone regardless of severity.

Later that afternoon you get called into the HR office. The HR Person is there; so is your boss. Your boss asks you what were you thinking about turning off your phone when On Call. Your only defense is that the phone was driving you crazy.

There is a silence in the room. Finally, the HR Person asks your boss, “Why was he getting barraged with calls? What is your logging policy?”

Your boss responds meekly, “We don’t have one.”

If you think this scenario is unique, think again. Many companies don’t implement a logging policy until it’s too late.

An enterprise-wide logging policy gives a company the consistency and reliability required to provide critical information for proper operation. Also, it’s a good return on investment.

Implementing an enterprise-wide logging policy is a three-step process: establish the policy; communicate it throughout the company; then, once the policy is in play, enforce it.

Allow me to elaborate.

Establishing a Logging Policy

When it come to establishing a logging policy, the are two good rules of thumb:

Less is better
Keep it simple

Effective policies are short, direct, and easy to follow. When a policy becomes too complex, people will either forget the fine points or just decide to ignore the rules completely.

Defining Log Levels

There are two parts to logging: sending entries in and getting entries out. In terms of sending log entries in, a good policy is to define each level of logging and when a particular level is to be used.

Level	Description	Example	Interested parties
DEBUG	Information relevant to programmers and system developers	log.debug("entering function foo()")	Persons maintaining the application or system who need detailed insight into the code’s operation, typically developers
INFO	Information that describes operational events	log.info(JSON.stringify(eventData))	Persons monitoring application or system operation: developers, Q/A personnel, system admins
WARN	Information indicating a dangerous condition	log.warn("Memory running low, current usage: 15.9GB, max capacity: 16GB")	Persons monitoring application or system operation: developers, Q/A personnel, system admins
ERROR	Information that describes an application or system error	log.error(JSON.stringify(exception))	Persons monitoring application or system operation: developers, Q/A personnel, system admins
FATAL	Information that a catastrophic state exists	log.fatal("Region US-WEST-1 unavailable))	Persons responsible for the ongoing health and proper operation of the application or system: developers, Q/A personnel, system admins, operations management

Table 1: Log Levels with description, example, and relevant parties

Once log level usage is defined, a policy for storing log data needs to be defined.

Defining Storage Constraints

A typical pattern with companies that embrace logging is that they want to log everything. Yet, storing everything without constraints, all the time, comes with risks. Log files can grow so large as to eat up all the disk space available. Yet, there might be good reason to store error information on the hosting machine—for example, making information readily available at the source when inspecting a troubled machine. Thus, a reasonable storage policy is to collect error information directly on the machine generating the error in addition to logging all log levels with a scalable logging service. Also, another part of the policy is to store data according to log level. For example, sending WARNING and ERROR data to a repository where predictive analysis can be performed.

Defining Access and Notification Policy

Getting log data out depends on the operational needs and security policies of the company in question. In terms of notification, we don’t want to have a scenario like the one described at the beginning of this piece. Too much information will turn into noise. Thus, a notification policy needs to answer the simple questions: who gets notified about what and how does the notification take place?

The same questions can be applied to accessing log information. There is a lot of sensitive information stored in logs that is useful to those with evil intentions. Yet many companies treat log data as information that is “just some stuff about our systems.” Those in the know understand otherwise. Thus, having a policy that covers log access is critical. Just ask those in IT Governance who struggle to meet PCI-DSS compliance on a consistent basis.

Communicating a Logging Policy

Getting employees to pay attention to any policy, let alone a policy as ambiguous as logging, is a hard undertaking. Still, for a logging policy to be effective, it has to be known.

The best way to communicate a logging policy is to create a message that will encourage employees to learn about the policy. Once defined, the message needs to be publicized repeatedly. Such communication requires a good deal of creativity. The message could take the form of a listicle published on the IT Operations intranet, titled, 7 Facts in the Logging Policy That Will Amaze Your Friends and Impress Your Relatives. Or it could be an announcement, by the department head, at the beginning of the weekly IT operations meeting.

Enforcing a Logging Policy

Once a policy has been defined and communicated, it needs to be enforced. The easiest way to enforce a logging policy is to automate as much of it as possible. At the code level this translates into making it so that logging is built into the code base and programming tools that developers in the company use. For example, using techniques such a Python decorators, Java annotations, or NET attributes to generate log events automatically— BeforeSave and AfterSave, for instance. Also, you can build a set of logging plugins that can be added to a developer's Integrated Development Environment (IDE).

Another technique is to get somebody in IT Operations to volunteer to be a Policy Wonk. At the team level, the Policy Wonk takes the lead ensuring that logging policy is supported in the code review process and as part of DevOps post mortems. At the departmental level, the Policy Wonk can provide expertise at pre-release review.

Another surprisingly effective way to enforce policy is by using tokens of recognition. Yes, the technique is a bit tongue-in-cheek, but those of us who have broken the nightly build can attest, the Build Breaker Trophy works. People like recognition. Giving an award for supporting a logging policy might seem a bit juvenile. But, it works.

The person in the Policy Wonk role should be on the lean side of authoritarian. The person needs to encourage others to follow the policy at hand, not to become a brute force enforcer. The goal is to get people to comply, not acquiesce. Without a willingness to comply, people will just slip back into old behaviors.

Putting It All Together

Implementing a simple, easy-to-follow logging policy is an important, cost-effective step toward improving log management within the enterprise. However, the key to maintaining a long-term logging policy is for the enterprise to be willing to change course to meet the conditions at hand. Logging technology will change, as will the scope of logging. The most important thing is to make a commitment throughout IT operations to have a culture that supports a logging policy and to have the flexibility to change the policy to meet the new demands that will appear on the horizon.