Last updated at Tue, 05 Dec 2017 20:28:15 GMT

Synopsis:

In this article I will be sharing some ideas that I’ve used from my experiences that will help streamline and take a lot of the work out of managing a Nagios deployment. I will go into multiple ways to manage your deployment. As you read on I will introduce a more complete solution. We will begin with git and cron, extend that to use subtrees, and then move along to an enterprise deployment with Puppet and ERB along with the aforementioned tools.

Git:

My philosophy is that just about everything should be under version control including most of your Nagios configuration. Git, a version control, or source code management tool, keeps tracks of your file changes. The benefits of using Git for Nagios, like any other project, include the following but not limited to:

  1. Git is distributed so it can serve as a backup of your Nagios Plug-ins and configurations across many hosts
  2. Contains a history of all changes with comments about each change (Solves: “I can’t remember why I did that last month.”) or (Solves: “Ah! Bob was the one who introduced that feature, nice.” )
  3. The ability to revert to files at any point in time of the repositories lifecycle.
    (Solves: “I just broke Nagios, it won’t start, I can’t figure out what I did wrong”)
  4. Improved collaboration with team members (Solves: “Bob, can I get that Nagios config file you were working on?”)
  5. Synchronization from repositories among systems (Solves: “Crap, now I have to copy these new Nagios plug-ins to 100+ hosts”)
  6. No need to log into the Nagios server to make changes to the configuration directory with automated pulls.
  7. Statistics for mangers (Solves: “Who’s making the most constributions?”)
  8. Learning tool for team by reviewing commits (Solves: “Oh, that’s how Bob accomplished that, smart guy.”)

If you do not have an internal Git server then I highly recommend using the Github.com service. It’s very inexpensive, or free for public repositories. If you are looking to host your own Git server, I’ve come to like the community version of Gitlab that is similar to Github’s sleek web interface. Also, Bitnami provides the entire Gitlab software stack in a single installer so it’s nearly effortless to get it up and running. For those that do not desire a web interface then you can use Git over SSH, filesystem, HTTP, etc.

In the solutions for this section I propose at least two repositories. One repository to store your Nagios configuration files and another to store your Nagios plug-ins.

I store a number of the Nagios plugins that I have written on Github.com where anyone can easily use my plug-ins, modify them, or contribute back to them to make improvements, all easily with git. To obtain them clone the repository

 git clone https://github.com/jonschipp/nagios-plugins

Using Git to Automate the Distribution of Nagios Plug-ins:

Copying custom Nagios plug-ins, as they become available, to all your monitored systems is a pain.

Fortunately, there’s a cheap and quick solution suing git. It’s outlined below:

  1. Store all, or most, Nagios plugins your organization uses in a Git repository accessible from the network.
  2. Install the Git software on all the monitored machines
  3. Configure authentication to the repository (if required) for each client system e.g. with SSH keys
  4. Create an hourly cron job or task to pull down the lastest plug-ins or changes from the repository for each machine

That’s it. You’ll never have to worry about copying plug-ins ever again. Any new plug-ins added to the repository or any changes to the existing plug-ins will be applied for each host on the hour every hour. Now you just make your changes and push them to your Nagios plug-in repository, take a break or work on something else, then after the hour the hosts will be updated with the latest changes.

If your plug-ins do not hold any confidential or company data then it maybe be okay to avoid authenticating to the repository but only for reads. Let your security team make the call.

Example Implementation:

  1. For brevity, no authentication
  2. Using default plug-in directory at /usr/local/nagios/libexec
  3. Syncing with my public Github repo

Client machines are configured to use sudo for plug-ins that require non-standard user privileges:

$ cat /etc/sudoers.d/nagios 
# Scripts will fail if nagios cannot use a tty.
Defaults:nagios !requiretty 
# Allow user nagios to run nagios scripts as root 
nagios ALL=(root) NOPASSWD: /usr/local/nagios/libexec/*

As the root user, do:

$ mkdir -p /usr/local/nagios/ 
$ cd /usr/local/nagios 
$ git clone https://github.com/jonschipp/nagios-plugins.git libexec

We created the nagios directory and cloned the repository into the default plug-in directory called libexec.

Let’s create the cron job and run it as the root user to automatically sync with the repository.

$ grep nagios /etc/crontab 
@hourly root cd /usr/local/nagios/libexec && git pull

**Note: **Be sure to set appropriate plug-in permissions for the files in the repository so permissions get synced properly. For security reasons, the nagios user should only be able to read and execute the plug-ins.

That’s it. Now on your workstation you can clone the same repository and add a new plug-in to it like this:

 $ cd ~/myrepos 
 $ git clone https://github.com/jonschipp/nagios-plugins.git 
 $ cd nagios-plugins 
 $ cp ~/new/plugins/check_everything.sh ~/myrepos/nagios-plugins/ 
 $ git add check_everything.sh 
 $ git commit -m "adding new plug-in to try out called check_everything" 
 $ git push

Then at the next hour, all the machines will have that plug-in. Plug-in removal can be done with

 $ git rm check_everything 
 $ git add check_everything 
 $ git commit -m "removing check_everything, because it doesn't really check everything" 
 $ git push

Again, at the next hour, the plug-in will be gone from all those systems. If an hour is not often enough adjust the cron job time as needed.

Using Git to Manage the Nagios Configuration:

Now let’s add the /usr/local/nagios/etc directory to version control so it is easy to keep track of all config changes. One of the big benefits, like when using Github, in my opinion, is being able to make changes to the server configuration without having to log into the Nagios server system. A benefit is being able to spin up a new Nagios server with the exact same configuration by installing the dependencies and then cloning the repository.

To get our Nagios server’s configuration into a repository create a new repository on Github.com, your Git server, etc. Log into the nagios server and go to the configuration directory:

 $ ssh nagios-server.company.com 
 $ apt-get install git 
 $ cd /usr/local/nagios/etc 
 $ git init 
 $ git add . 
 $ git commit -m "initial commit adding everything in the directories" 
 $ git remote add origin git@github.com:company-X/nagios-config.git 
 $ git push origin master

Now we have the Nagios server configuration in a repository. Changes can be made to the configuration and pushed back up e.g.:

 $ cd /usr/local/nagios/etc/objects/ 
 $ vim templates.cfg 
 $ git add templates.cfg 
 $ git commit -m "increased max_check_attempts to 2 for critical host template" 
 $ git push

Or, you can set up a cron job on the nagios server that runs every 5 minutes to pull down the latest configuration from remote repository. This way, we can make our changes from workstations and avoid having to log into the nagios server.

 # Nagios config 
 */5 * * * * root cd /usr/local/nagios/etc && git pull

Note 1:
If you do not need to store the configuration on another server, it’s more efficient to make a repository out of the Nagios configuration directory and push directly to the server over SSH. This way the cron job on the server can be removed and the changes via the hook will be applied as soon as the server merges in the data from the push. Though, a stipulation is that you will need to be able to contact the Nagios server directly.

Note 2:
Though. you should stick with one method otherwise it’s possible to have conflicts like when making changes on the server without syncing to the repository and then making another change from your workstation merge into remote repository. When the cron job runs, Git will see that the two locations are not in sync and will not apply the changes until someone manually resolves the conflicts. All cron jobs will fail until the conflicts are manually resolved on the server. To avoid this, make your team agree to not making changes directly on the server or monitored hosts i.e. writing should only happen from employee workstations and reading (pulling) from the server and monitored hosts.

Git Hooks to Apply a Nagios Configuration:

We can go further by building on top of the previous section to automatically put the new changes into effect after they’ve been pulled down from the remote repository. For this we will use Git hooks which are points in a Git workflow from which Git can execute scripts.

In each Git repository there is a hidden .git directory which is where Git stores all its information about the repository, its history, and so on. In this hidden directory there’s another directory called hooks which is where we place our script to run with the file name of the Git life cycle point. When a git pull is performed two steps are done: a git fetch and a git merge. We want to run our hook script when Git merges in the new changes to the repository so we must name it post-merge.

The following simple shell script first validates the new Nagios configuration after it’s pulled down. If it passed the Nagios configuration check then the new configuration is applied by restarting Nagios and an e-mail is sent to the admin team indicating that the new configuration was successful and is now in effect. If the configuration check does not pass validation then Nagios is not restarted and an e-mail is sent to the admin team mentioning an error in the configuration. In both cases the diff of the latest change is sent in the e-mail to give context of the new configuration.

$ cat .git/hooks/post-merge 
#!/bin/bash 

fail(){ 
mail -s "[nagios] [broken] nagios config failed validation" sa-admin@touchofclass.com <<EOF 
$(date) 

There's an error resulting from the changes that were recently made. 
Please make a commit correcting the issue and push it back up to the nagios-configs repository. 

You can manually check the nagios configuration for syntax errors with the following command: 
``/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg'' 

The last commit was (note there may have been multiple commits): 
$(cd /usr/local/nagios/etc && git log -1 -p) 
EOF 
} 

succeed(){ 
mail -s "[admin] New nagios configuration in effect" sa-admin@touchofclass.com <<EOF 
$(date) 
Success! Nagios has been restarted after changes. 

Most recent commit: 
$(cd /usr/local/nagios/etc/ && git log -1 -p) 
EOF 
} 

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg 
if [[ $? -ne 0 ]]; then 
    error 
    exit 1 
fi 

service nagios restart 

[[ $? -ne 0 ]] || success

Using Git to Manage the NRPE Configuration

NRPE, the ultimate Nagios plug-in for Linux machines, has a configuration file that is a good candidate for a Git repository.

  1. A single NRPE configuration file
  2. All machines can share the same file if done properly

We can reduce complexity by using a single NRPE config file since the file is very simple and in most situations there’s not much need to have different versions of the file for different hosts. As explained in the other sections, synchronization is done with a simple cron job so that all systems have the same copy on the hour.

To do this and not have to worry about setting the server_address or allowed_host variables which are machine dependent we will have to tell NRPE to include another file i.e. our file in the repository which is where all our custom checks will be located. This is so that package updates don’t overwrite our configuration. Ubuntu’s nrpe package has the include option on by default, so I’ll be using this as an example for the rest of this section:

 $ grep ^include /etc/nagios/nrpe.cfg
 include=/etc/nagios/nrpe_local.cfg 
 # only snipplets ending in .cfg will get included 
 include_dir=/etc/nagios/nrpe.d/

Notice that nrpe.cfg includes the nrpe_local.cfg or any file in the nrpe.d directory with a .cfg extension. What we need to do is use the nrpe.d directory as the repository or choose another directory and update the include_dir parameter. I’m choosing to use /etc/nagios/nrpe.d as the repository:

 $ apt-get install git 
 $ rmdir /etc/nagios/nrpd.d 
 $ git clone git@github.com:Org-X/nrpe-config.git /etc/nagios/nrpe.d

There’s different ways this could be done but I’m replacing the nrpe.d folder with our repository. Do that and then configure a cron job for each system.

# Nagios NRPE config 
@hourly * * * * root cd /etc/nagios/nrpe.d && git pull

From now on, clone the repository on your workstation, make changes, and then push it back up to the remote repo and on the hour all your workstations will have the latest file. We can also add a git hook to each machine, explained in the section titled “Using Git Hooks to automatically apply a Nagios Configuration”, which can be used to automatically restart the NRPE daemon after the git pull so the changes take effect immediately on the hour. Something as simple as the following would do the trick:

 $ cat .git/hooks/post-merge 
 #!/bin/bash 
 service nrpe restart

Note: In the case where it’s not possible to pass parameters to NRPE from the Nagios server and the servers are not all the same, one can use multiple commands for each variation. The following example has multiple versions of check_load with parameters to account for physical CPU differences of the hosts.

 # System 
 command[check_load_8_cores]=/usr/local/nagios/libexec/check_load -w 6 -c 8 
 command[check_load_16_cores]=/usr/local/nagios/libexec/check_load -w 12 -c 16 
 command[check_load_host_big_dataserver]=/usr/local/nagios/libexec/check_load -w 50 -c 64

I don’t consider this much of an issue but extra checks will need to be added so that all systems can use the same configuration file. It’s useful to prefix the hostname to the check name if only one host uses the check. e.g. the hostname in the following example is tele.

command[tele_check_voip]=/usr/local/nagios/libexec/check_voip status

Though, this can all be avoided by using Git in combination with Puppet and ERB which will be explained later.

Consolidating Repositories with Git Subtrees:

To build upon the usage of Git, subtrees can be used to combine multiples repositories under one. Subtrees are a little more complicated than what we’ve discussed so far so I refer you to this excellent article on creating them. A requirement is that each repository must be in it’s own directory somewhere under the parent. For example, creating a new repository called nagios which has two subtrees as subdirectories: nagios-configs and nagios-plugins.

 $ tree nagios 
 | -- nagios 
 | | -- nagios-configs 
 | | -- nagios-plugins 
 2 directories, 0 files

The benefit of this is that you update the subtrees which will pull in all the changes from each included repository. A useful example could be to sync multiple Nagios plug-in repositories each with different plug-ins, from different authors, under one main repository.

 $ tree nagios-plugins/ 
 nagios-plugins/ 
     linux-plugins 
         check_load 
         check_procs 
         check_syslog 
     network-plugins 
         check_bps 
         check_interface 
         check_pps 
     windows-plugins 
 check_iis 
         check_windows_update 
 3 directories, 8 files

You can then create a small script to update the subtrees which will pull in all the changes from each of the repositories. In the example of Nagios plugins one would follow the same steps as in the section titled “Using Git to Automate the Distribution of Nagios Plug-ins” but cloning the new repository with the subtrees to libexec. libexec will then have three sub directories each with plug-ins.

Macros can then be added to the Nagios resource file which point to each of the sub directories under libexec. Using check_by_ssh from the server, the command definitions would use the macros to find the locations of the plugins on the client.

 $ cat /usr/local/nagios/etc/resource.cfg 
 $USER1$=/usr/local/nagios/libexec 
 $USER2$=/usr/local/nagios/libexec/windows-plugins 
 $USER3$=/usr/local/nagios/libexec/linux-plugins 
 $USER4$=/usr/local/nagios/libexec/network-plugins

Finishing on the previous example, the check_by_ssh plugin is located in the default plugin directory, pointed to be $USER1$, on the Nagios server but the plugin to be executed on the client is in the directory with the Linux plugins ($USER3$). This means we can use the macros as short hand form to tell Nagios where to find the plugins.

 define command{ 
     command_name check_linux_service 
     command_line $USER1$/check_by_ssh -p 22 \ 
     -H $HOSTADDRESS$ -l nagios -i /home/nagios/.ssh/$HOSTNAME$ \ 
     -C 'sudo /usr/local/nagios/libexec/$USER3$/check_service.sh -o linux -s $ARG1$' 
 }

This is useful for organizations where Windows admins write Windows plugins and Linux admins write Linux plugins. Each group would have their own plugin repository which would added to the main plugin repository as subtrees. This means for the monitored hosts you would only need to clone the main repository to get the latest changes. Another use case is if you find a number of plugin repositories on Github and want to stay current with the author’s changes. Subtrees help solve that problem by allowing you to update them independently.

I recommend adding a script to the repo to pull in the changes for each subtree. You would run this periodically to pull down the changes and then push them up to the parent repository.

 #!/bin/bash 
 if ! git remote | grep -q linux-plugins; then
     echo "adding linux-plugins remote" 
     git remote add linux-plugins https://github.com/Org-X/linux-plugins.git 
 fi 
 if ! git remote | grep -q windows-plugins; then 
     echo "adding windows-plugins remote" 
     git remote add windows-plugins https://github.com/Org-X/windows-plugins.git 
 fi 
 if ! git remote | grep -q network-plugins; then 
     echo "adding network-plugins remote" 
     git remote add network-plugins https://github.com/Org-X/network-plugins.git 
 fi 
 
 git pull -s subtree -Xsubtree=linux-plugins 
 git pull -s subtree -Xsubtree=windows-plugins 
 git pull -s subtree -Xsubtree=network-plugins 
 
 git push

Puppet

Puppet is a system provisioning tool that uses a DCL (Domain Specific Language) to configure hosts into a particular state. There’s a rather large learning curve with Puppet but once you get over it you will be glad you did.

Using Puppet, or similar, in combination with Git is the ultimate solution.

Puppet + Git:

Using Git repositories in combination with Puppet we can push out the plugins to the machines on a puppet run rather than using a cron job. We can also say which machine will get the NRPE client and the plugins on a puppet run or which machine will become a new Nagios server.

Place your Nagios repositories as subtrees under a puppet repository and configure your Puppet manifests to use them:

 $ tree -d 
 . 
 | -- puppet 
 | | -- modules 
 | | | -- nagios 
 | | | | -- files # Files that get copied to systems 
 | | | | | -- nagios-config # subtree 
 | | | | | | -- objects 
 | | | | | | | -- hosts 
 | | | | | | | -- templates 
 | | | | | -- nagios-plugins # subtree 
 | | | | -- manifests # Puppet configuration files 
 | | | | -- templates # Files that get modified e.g. nrpe.cfg using ERB

A file resource type in Puppet can be used to automatically copy the files under nagios-plugins to each machine and set the permissions and ownership. When the subtree is updated puppet sees that the files have changed and will copy over the new changes each time.

 $nagios_plugins = "/usr/local/nagios/libexec" 
 
 file { $nagios_plugins: 
     ensure => "directory", 
     owner => "root", 
     group => "nagios", 
     mode => 0550, 
     recurse => true, 
     source => "puppet:///modules/nagios/nagios-plugins", 
 }

Installation Automation:

Something like the following can be used to install NRPE on all hosts on a puppet run. In sum:

  1. It install the openssl package dependency
  2. It copies the install_nrpe.sh shell script to the machine
  3. It executes the shell script which downloads and compiles NRPE
  4. It enables the nrpe service on boot and starts it
class nagios::nrpe_install { 
     $version = '2.15' 
     $install_script = "/usr/local/nagios/install_nrpe.sh" 
 
     package { 'openssl-devel': 
         ensure => installed, 
     } 
 
     file { $install_script : 
         source => "puppet:///modules/nagios/install_nrpe.sh", 
         mode => 755, 
         require => Package["openssl-devel"], 
     } 
 
     exec { "$install_script": 
         logoutput => true, 
         timeout => 600, 
         unless => "/usr/local/nagios/bin/nrpe -h | /bin/grep -q 'Version: 2.15'", 
         require => [ File[$install_script], Class['nagios::nrpe_configure'] ], 
     } 
 
     service { 'nrpe': 
         name => nrpe, 
         ensure => true, 
         hasrestart => true, 
         require => Exec[$install_script], 
     } 
 }

The server installation can be automated easily as well but I’m omitting for brevity. With puppet I’m able to deploy a production Nagios server with a working configuration including the web interface in just a little over 2.5 minutes.

$ time vagrant --provision-with puppet 
[default] Running provisioner: puppet...
Running Puppet with server.pp... 
stdin: is not a tty 
notice: /Stage[main]/Nagios::Server_preparation/Group[nagcmd]/ensure: created 
notice: /Stage[main]/Nagios::Server_preparation/User[nagios]/ensure: created 
notice: /Stage[main]/Nagios::Server_preparation/User[www-data]/groups: groups changed '' to 'nagcmd' 
notice: /Stage[main]/Nagios::Server_plugins_install/Exec[installnagiosplugins]/returns: executed successfully 
notice: /Stage[main]/Nagios::Server_plugins_install/Exec[plugin package clean up]/returns: executed successfully 
... 
notice: Finished catalog run in 158.73 seconds 
real 2m46.308s 
user 0m2.244s 
sys 0m0.780s

NRPE Configuration and Embedded Ruby:

Puppet can use ERB (Embedded Ruby) to modify files before they are pushed to machines. Facter, a component of Puppet, can gather facts about the machine e.g. number of cpus, memory size, etc. and they are stored as variables and can be used to perform actions such as writing to a file.

The NRPE configuration file, nrpe.cfg, is a good candidate for this section because parts of it are machine specific like the the server_address variable which should hold the IP address of the interface NRPE is binding to. Here’s a snippet of the nrpe.cfg file’s server address section where ERB is used to replace the server_address with the IP address of eth0 and to place the Nagios server address(es) with the value of a variable specified in a Puppet manifest file.

 # SERVER ADDRESS 
 # Address that nrpe should bind to in case there are more than one interface 
 # and you do not want nrpe to bind on all interfaces. 
 # NOTE: This option is ignored if NRPE is running under either inetd or xinetd 
 server_address=<%= @ipaddress_eth0 %> 
 allowed_hosts=<%= @nagios_server_ip %>

If your machines only have one interface, or eth0 is the right one to use, then voila! NRPE is configured automatically.

Another example allows us to use a single check_load command that will work on all machines even if they have a different number of CPU’s because of a fact and ERB expression that automatically set good values for the warning and critical thresholds.

 # System 
 command[check_load]=/usr/lib/plugins/check_load -w <%= @processorcount.to_f.round(2)/1.5 %> -c <%= @processorcount %>

And one final example is that pieces of the configuration can either end up or not end up in the host’s nrpe.cfg based on a ERB conditional. This following example enables the dont_blame_nrpe option if the machine’s hostname is syslog or syslog1.

./configure --command-enable-args

dont_blame_nrpe is an option that when enabled allows the Nagios server to pass macros to the NRPE configuration file. This is a security risk that the NRPE creators have warned about and should only be used if necessary. To avoid the situation where all hosts have that enabled we can use Puppet, Facter (hostname fact), and ERB to significantly minimize the exposure window by only enabling dont_blame_nrpe on hosts that absolutely require it.

 # COMMAND ARGUMENT PROCESSING 
 # This option determines whether or not the NRPE daemon will allow clients 
 # to specify arguments to commands that are executed. This option only works 
 # if the daemon was configured with the --enable-command-args configure script 
 # option. 
 # 
 # *** ENABLING THIS OPTION IS A SECURITY RISK! *** 
 # Read the SECURITY file for information on some of the security implications 
 # of enabling this variable. 
 # 
 # Values: 0=do not allow arguments, 1=allow command arguments 
 <% if (hostname == 'syslog') || (hostname == 'syslog1') then %> 
 dont_blame_nrpe=1 
 <% else %> 
 dont_blame_nrpe=0 
 <% end %>

Hopefully this gives you plenty of ideas on how to better manager your Nagios deployment. I also recommend checking out the Puppet’s Nagios resource type.

Complementary Tools

More Reading & Other Resources