Last updated at Tue, 05 Dec 2017 20:41:52 GMT
Synopsis:
Having written a number of Nagios plugins I know that they can take some time. Sometimes you only need to solve simple problems like checking file permissions, file existence, mount points, network settings, etc. where writing a plugin is inconvenient.
Luckily, the features of a modern shell can be used to create some really concise and convenient Nagios checks and configurations. Some of these checks can be written as shell one-liners in Bash using features such as subshells, control operators, command substitution, and group commands.
In this article I will demonstrate a few examples of using the Bash shell to accomplish simple tasks rather than writing full programs.
Background
Nagios determines the status of the check by the exit code of the script the check ran. The whole of the plugin API consists of four exit codes that are used by Nagios to indicate a particular state of a check. For example, ending the execution of a script with an exit code of 2 will register a critical status in Nagios. To do this in bash you would use the exit builtin and specify the code as an argument.
exit 2
The available codes Nagios cares about are listed below with their corresponding status.
OK=0
WARNING=1
CRITICAL=2
UNKNOWN=3
Note that Bash also has return codes which are not to be confused with exit codes. Instead they’re used in functions to return a status to the caller which can be used to determine whether the calling function succeeded or failed.
We will use Bash’s control operators such as logical AND (&&) and logical OR (||) as a short hand to create a workable command execution order. Successful programs, by convention, exit with a code of 0. Knowing this we can set up the execution order so that if a program fails then an alternate one will execute or if a program succeeds another should only run immediately after. Both are illustrated with the control operators below
command1 || command2 # Run command2 if command1 fails
command1 && command2 # Run command2 if command1 succeeds
We will also use the exit builtin in subshells to create additional chains of commands. Instructions placed between two parenthesis will be executed in a subshell.
( echo "this echo command is executed in a subshell" && exit 0)
If we were to just call exit our program would end but by using subshells we only exit the subshell (a separate process) and can use control operators to handle execution order based on the exit code.
In the example below the final echo command will not execute if we exit the currently running shell so to avoid this we will use a subshell which will still return a exit code to the parent shell that can be used for execution order. It’s best illustrated by trying the following commands in a new terminal window.
false || (echo "I'm being executed in a subshell" && exit 0) && echo "Made it" # Using a subshell
false || echo "I'm being executed in a subshell" && exit 0 && echo "Made it" # Not using a subshell
In the following sections I will place the Nagios context of the exit code value in parentheses.
One Liner Command Checks:
Example 1
My first example is to check the number of running processes by name. The command definition below takes two arguments that are provided in the service definition. Awk exits with code 2 (CRITICAL) if the number of lines are greater than the given argument otherwise awk exits 0 (OK). The number of processes matching the criteria is printed as a helpful message irrespective of the condition.
define command {
command_name check_multiple_nagios
command_line /usr/bin/pgrep -lf "$ARG1$ | /bin/awk 'NR > $ARG2$ { exit 2 } END { print NR,"instances running" }'
}
We pass the arguments to the command via the service definition. Arguments are separated by the exclamation character. We exit 2 (CRITICAL) if 3 ($ARG2$) or more Nagios daemon processes ($ARG1$) are running.
define service {
host_name localhost
service_description Running Multiple Nagios Instances
check_command check_process_number!/usr/local/nagios/bin/nagios -d!3
use service-template }
Example 2
Our second example is to check for 1Gb NIC’s negotiated at 100Mb, a problem I’ve been experiencing on a few servers that is typically caused by an Ethernet cable going bad.
This service definition uses NRPE so the command will be specified in the NRPE configuration.
define service {
service_description Check Ethernet Speed
check_command check_nrpe!check_for_100mb_link
hostgroup_name linux-servers
use service-template
}
NRPE will run the command listed below which examines the link speed for any available interfaces named eth0 through eth9. If grep matches 100 it will exit 0 and run the next two commands after the logical AND (&&) operator and then finally exit 2 (CRITICAL). If grep is not able to match the expression it will exit 1 which will cause the command after the OR (||) operator to execute. Once the echo command prints the informational message it will then exit 0 (OK) indicating to Nagios that everything is alright.
command[check_for_100mb_link]=/bin/bash -c '/bin/cat /sys/class/net/eth[0-9]/speed 2> /dev/null | \
/bin/grep "^100$" 2>&1 > /dev/null' && echo "CRITICAL: 100Mb link found" && exit 2 \
|| echo "OK: Not 100Mb"
Example 3
A third example is to check directory or file permissions. Here, I am testing a web directory for execute ( -x ) and read ( -r ) permissions. If the first and second test exit 0 then a message is printed from a subshell which will also exit 0 indicating to Nagios that the check was successful. If any of the commands exit non-zero then everything after the logical OR (||) is executed which results in the last exit code returning 2 (CRITICAL) for Nagios.
define command {
command_name check_directory_permissions
command_line test -x $ARG1$ && test -r $ARG1$ && (echo "OK: $ARG1$") \
|| (echo "CRITICAL: $ARG1$" && exit 2)
}
define service {
service_description Check Web Dir Permissions
check_command check_directory_permissions!/var/www/html
use service-template
}
Macro Conditionals:
Nagios has a feature called event handlers that allow you to execute programs after a certain check status. Our example uses an event handler to restart the rsyslog daemon if Nagios detects that the daemon is not longer running. Event handlers are called on a number of different conditions and because of this require the user to specify when the conditions are satisfied to execute the appropriate commands in the event handler script itself. Instead of doing that we can use bash test builtin to run conditionals on the Nagios macros post-substitution and have all the instructions we need directly in the command definition.
The check_nrpe command at the end will only be run if macro’s $SERVICESTATE$ is CRITICAL and $SERVICESTATETYPE$ is HARD.
define command{
command_name rsyslog_service_restart
command_line test $SERVICESTATE$ = CRITICAL && \
test $SERVICESTATETYPE$ = HARD && \
$USER1$/check_nrpe -H $HOSTADDRESS$ -t 10 -c 'rsyslog_service_restart' \
|| (exit 0)
}
This means the value of ‘rsyslog_service_restart’ in the NRPE config can be as simple as:
[rsyslog_service_restart]=service rsyslog restart
This way we can avoid writing another script to handle all the conditions as done in the Nagios event handler documentation example which seems unnecessary.
Command Substitution:
Standard POSIX command substitution ($(..)) or the shell variable character ($) will not work in a command specified in a Nagios config file because Nagios macros also use the dollar sign and will not be able to properly tell when one variable ends and another begins.
$SERVICESTATE$ # Nagios macro
$ echo $(seq 1 10)
1 2 3 4 5 6 7 8 9 10 # Example of command substitution
A solution around this is to use Bash’s older command substitution character, the accents ( ' ). Anything between the accents is executed and its resulting contents substituted in place of the expansion command.
$ echo `seq 1 10`
1 2 3 4 5 6 7 8 9 10 # Example of command substitution using deprecated syntax
I’ve found use in this by displaying system specific information in Nagios’s notification e-mails. Imagine having multiple Nagios servers in one organization that all have their configurations stored in a Git repository each with its own branch. By using expansion to specify the system’s hostname, indicating the host where the notification came from, allows me to have the exact same command definition for all 3 hosts in git which is very nice. Keep it simple instead of hard coding the host name in branch.
Example,
$ hostname -s
nagios-dev
The following command definition sends host notifications to an IRC channel where hostname -s
is expanded to the system’s short hostname so that we know which system the notification came from.
# IRC - host notifications
define command{
command_name notify-host-by-irc
command_line /usr/local/bin/ircsay "#nagios" "[`hostname -s`] \
$NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ ($HOSTADDRESS$) is $HOSTSTATE$: $HOSTOUTPUT$"
}
The result lets me know that this alert came from our dev machine:
04-10 15:13 [nagios-dev] RECOVERY Service Alert: Check Snort Memory for sniffer1 (1.1.1.1) is OK: 04-10 15:13 * RSS OK: 0 processes with command name snort
The same can be done for notification e-mails too:
notify-host-by-email' command definition
define command{
command_name notify-host-by-email
command_line /usr/bin/printf "%b" " ...Host: $HOSTNAME$\nState: $HOSTSTATE$\..." \
| /bin/mail -s "[admins] [`hostname -s`] Host Alert: $HOSTNAME$ is $HOSTSTATE$" \
$CONTACTEMAIL$
}