Nagios Network Monitoring

Nagios - http://www.nagios.org

From their website : Nagios is a host and service monitor designed to inform you of network problems before your clients, end-users or managers do. It has been designed to run under the Linux operating system, but works fine under most *NIX variants as well. The monitoring daemon runs intermittent checks on hosts and services you specify using external "plugins" which return status information to Nagios. When problems are encountered, the daemon can send notifications out to administrative contacts in a variety of different ways (email, instant message, SMS, etc.). Current status information, historical logs, and reports can all be accessed via a web browser. Content Coming Very Soon!


Below are my experiences with Nagios, Plugins, and common stumbling blocks I have come across.

General Information and TidBits

Below are some of the more useful tidbits of Nagios and what things mean, how they work, etc.

Nagios Plugin Return Codes

Sometimes you want to write a simple script to check a few things and have this script work with Nagios.   Well, one of the 1st things you need to understand are the return codes.  Below are the basic return codes your script should return.

Read the following if you really want to start writing your own plugins.  Let me know know too and I will even test them.  Ref: http://nagiosplug.sourceforge.net/developer-guidelines.html

Numeric
Value

Services
Status

Status Description
0 OK Plugin executed and results are within parameters
1 WARNING Plugin executed and results are in warning range, or are not functioning properly
2 CRITICAL Plugin detected the service is not running, or the service is in the critical threshold
3 UNKNOWN Typically invalid check parameters, missing values, missing plugin components, etc.

 

Nagios Thresholds and Ranges

When creating alerts, I did not even realize at first there were functions that could more narrowly / flexibly define the thresholds such as this.  There are a few cases when you want the to trigger an alert when a value is too low, or less than a certain level. 

 

Ref: http://nagiosplug.sourceforge.net/developer-guidelines.html

Example ranges

Range definition

Generate an alert if x...

10 < 0 or > 10, (outside the range of {0 .. 10})
10: < 10, (outside {10 .. ∞})
~:10 > 10, (outside the range of {-∞ .. 10})
10:20 < 10 or > 20, (outside the range of {10 .. 20})
@10:20 ≥ 10 and ≤ 20, (inside the range of {10 .. 20})

Command line examples

Command line

Meaning

check_stuff -w10 -c20 Critical if "stuff" is over 20, else warn if over 10 (will be critical if "stuff" is less than 0)
check_stuff -w~:10 -c~:20 Same as above. Negative "stuff" is OK
check_stuff -w10: -c20 Critical if "stuff" is over 20, else warn if "stuff" is below 10 (will be critical if "stuff" is less than 0)
check_stuff -c1: Critical if "stuff" is less than 1
check_stuff -w~:0 -c10 Critical if "stuff" is above 10; Warn if "stuff" is above zero
check_stuff -c5:6 The only noncritical range is 5:6
check_stuff -c10:20 Critical if "stuff" is 10 to 20

NOTE: sometimes you need to escape the special characters.  See the example below...

./check_mssql -H hostname -p 1433 -U username -P password -D database -w 2.0 -c 3.5 \
 -q "exec database.dbo.StoredProcedure variable1, variable2" -W32\: -C10\: -s

The above will WARN when the value my stored procedure returns is LESS than 32 and CRITICAL when LESS than 10.

Nagios Object Inheritance

Nagios object inheritance can be a very confusing and/or tricky topic.  First you should read the official documenation on nagios object inheritance here.  Below is a very simple and straightforward example to illustrate a common application of inheritance you may want to use in your environment.

Your host/service definition entry uses a template...

use          app-server

The app-server template has two contact groups configured...

contactgroup_name          it,dev

Let's see what we can do in the host/service definition.  Here are some examples and how they should work.

contactgroup_name          +support

Using the "+" sign, the host/service definition uses the data in the template -and- adds the support group to the alerts.

contactgroup_name          !dev

Using the "!" sign, the host/service definition uses the data in the template -and- will remove the dev group from the alerts.  If dev was not specified in the template, this would have no effect.

contactgroup_name          support

Using no modifier will override the template values and only the support group will receive these alerts. 

contactgroup_name          +support,!dev

You can use combinations of modifiers to get the desired results as well.  Remember too if you use multiple templates you apply them in order.  The nagios official documentation have a very useful flow chart on this to help you understand.

Some may ask why not a line like below to add a contact group line like this to the host/service definition:

contactgroup_name          it,dev,finance

The negative to the above is when you want to add another contact group to ALL systems using the template, you then have to remember which systems have explicitly defined values.

Installation Tips & Tricks

@20090224
I am walked through a fresh Nagios install on a new CentOS5 Virtual Private Server using the Fedora Quick Install Guide.  I did notice that with my stripped down VPS starting OS, I had to install "make" and "openssl-devel" which were not mentioned in the guide.  Otherwise wi the latest release of nagios and nagios-plugins from http://nagios.org it all went easy as pie.

Addon: nagiosgraph - Lets make pictures!

What it does?

Scripts to parse perfdata and plugin output, store values in rrd databases and render trending graphs

http://nagiosgraph.sourceforge.net/

IMPORTANT - found an issue using certain versions of RRDTOOL.  For "ease", use version 1.2.27 when installing nagiosgraph.

Addon: rrdtool - Lets store the data!

RRDtool is required to get the nagiosgraph Addon working correctly. 

"RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data.  Use it to write your custom monitoring shell scripts or create whole applications using its Perl, Python, Ruby, TCL or PHP bindings." - http://oss.oetiker.ch/rrdtool/

I follwed the instructions outlined in the package.  I did run into one issue since I had to install libart-2.0 from source as well.  Read about the missing libart error/fix here.  Also to avoid an "undefined symbol: art_alloc" issue, I had to install a specific version of libart and you can find out about that undefined art_alloc issue here.

Tip - NRPE on Ubuntu

I usually stick with Fedora/RedHat and CentOS flavors of Linux.  Recently though, I needed to monitor a new Ubuntu test server via Nagios NRPE.  The sources I was using to build and install just did not play nice on first pass.  So I figured for the one off, I would try the easy way.

apt-get install nagios-nrpe-server

Of course it did not install things into the paths I am used to.  So my C&P config file did not work and I had to find where things got dropped.

NRPE Config file: /etc/nagios/nrpe.cfg

Nagios Plugins:  /usr/lib/nagios/plugins/

Init Script:  /etc/init.d/nagios-nrpe-server

Remind me why people like Ubuntu over other flavors for a server OS again?

Oh, and you'll probably also need:

apt-get install libssl-dev

Tip - Convert Nagios.log Timestamp

If you are like me, you have to dig through some log files to research some errors.  The ../var/nagios.log file has alot of data and unfortunately the times stamps are not exactly friendly to read.

[1256314960] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname;servicename;0;Service running OK

Using a little perl command line magic we can convert that ugly timestamp into something more readable.

[Fri Oct 23 11:22:40 2009] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname;servicename;0;Service running OK

Just use...

perl -pe 's/(\d+)/localtime($1)/e' nagios.log

Of course that will spew the entire log.  So using grep or tail, you can make it a bit more useful.

perl -pe 's/(\d+)/localtime($1)/e' nagios.log |grep server1

tail nagios.log | perl -pe 's/(\d+)/localtime($1)/e' 

Nagios SNMP Tools

Just a quick post about trying to get Nagios to use SNMP.  You need to make sure you have the components needed.  My stripped down little VPS I use for testing did not.  I needed to run:

yum install net-snmp-utils

That will install the utils and base package needed for Nagios and various nagios plugins.  I will update more as my testing continues.

Addons, Plugins, Tweaks & Customizations

Below are lists of Nagios Addons/Plugins as well as some common Nagios Tweaks / Customizations I have tried with my various Nagios installations.  If you have some ideas, suggestions, etc. please register and post comments.  Thanks.

Checking Drupal Status with Nagios and WebInject

Summary

A few weeks back I found a post on the Drupal forums about monitoring the status report page with Nagios and Webinject.  Having lots of practice with Nagios and Webinject, I knew this was possible but noone had provided an example.  So I finally got around to creating the Webinject script today and posted it.  Below is the complete process including the Nagios info I used.

 

Webinect XML Script

<testcases repeat="1">
<testvar varname="BASE_URL">http://www.domain.com/</testvar>
<testvar varname="LOGIN1">username</testvar>
<testvar varname="PASSWD1">password</testvar>
<case
id="1"
description1="Connecting to Login Page"
method="get"
url="${BASE_URL}?q=user"
verifypositive="Enter the password that accompanies your username"
errormessage="Unable to load login page"
/>
<case
id="2"
description1="Authentication"
method="post"
url="${BASE_URL}?q=user"
postbody="name=${LOGIN1}&pass=${PASSWD1}&form_id=user_login&op=Log+in"
verifypositive="${BASE_URL}\?q=users/${LOGIN1}"
errormessage="Login Post Problem"
/>
<case
id="3"
description1="Status Report Page"
method="get"
url="${BASE_URL}?q=admin/reports/status"
verifynegative="Out of date"
verifypositive="Drupal core update status"
errormessage="Status Page Alert!"
/>
</testcases>

Nagios Command File Entry

define command {
  command_name webinject
command_line /usr/local/nagios/webinject/webinject.pl -c nagios/$ARG1$ nagios/$ARG2$
}

Nagios Check Entry

define service {
  use template1
  host_name server
  service_description  status-report
  check_command webinject!nagios.xml!drupal_status.xml
}

Thoughts

The webinject script looks for something that is "Out of date" on the Status Report and will alert appropriately based on your Nagios configuration. The first step is not necessarily required, but it helps in troubleshooting if the login page for the site is not loading correctly and preventing the check from executing correctly.

AddOn - NRPE / NSClient

NRPE and NSClient allow you to remotely execute either pre-configured tasks or custome scripts to trigger alerts or as the result of an event (eg. an EventHandler).

NRPE Plugin
http://sourceforge.net/project/showfiles.php?group_id=26589

NRPE allows you to remotely execute Nagios plugins on other Linux/Unix machines. This allows you to monitor remote machine metrics (disk usage, CPU load, etc.). NRPE can also communicate with some of the Windows agent addons, so you can execute scripts and check metrics on remote Windows machines as well. A windows utility called NSClient is also available to accomplish the same thing on Windows hosts.

NSClient Plugin
http://trac.nakednuns.org/nscp/downloads
NSClient++, aka NSCP, aims to be a simple yet powerful and secure monitoring daemon for Windows operating systems. It is built for Nagios, but nothing in the daemon is actually Nagios specific and could probably, with little or no change, be integrated into any monitoring software that supports running user tools for polling.

ERROR: CHECK_NRPE: Socket timeout after 10 seconds.

Several conditions can trigger this error with your Nagios checks.  Many of them are obvious, but this one had me stumped for awhile.

Problem

All my nagios checks with NRPE to a given host were failing with the "CHECK_NRPE: Socket timeout after 10 seconds." message.  I logged into the host and made sure NRPE was running, even restarted it.  Double checked the firewall rules to make sure the port was open.  I went to my nagios server, did an NSLOOKUP, PING and TELNET to the port to ensure I was resolving the correct IP address and could connect.  The machine in question was a Virtual Private Server (VPS) so it does sometimes become sluggish and non-responsive, but poking around it all seemed fine.  I tested from the command line of my Nagios server and got the same results.

Solution

What got me looking in the right direction was when I pinged my Nagios server from my host.  It worked fine, but I noticed it took a few seconds to resolve the host.  So then I checked the DNS servers of my Linux VPS.  The first server listed was not pingable.  I quickly flip-flopped the servers in my resolv.conf and VOILA!  My command-line check from my Nagios server fixed it.

ERROR: Could not fetch information from server

While setting up several new servers and installing NSCLIENT, I ran into the following error message:

could not fetch information from server

The most logical first step is to re-verify the Nagios server config file.  Check to make sure DNS resolution is correct.  Second, take a look at the NSC.log on the client system.  In my case, I saw:

2009-03-30 10:52:23: error:.\NSClientListener.cpp:307: Unauthorized access from: 172.20.16.182

Well, that could definitely be a problem.  My fault this time was in editing the NSC.ini after installation.  The allowed_hosts line of:

allowed_hosts=172.20.16/23

needed to be like:

allowed_hosts=172.20.16.0/23

AddOn - Nagios Event Log aka NagEventLog

NagEventLog is a windows agent that examines the EventLog, filters it, and forwards passive alerts to Nagios via NSCA. Now with encryption support! Supports Windows 2000 and later.

More information can be found here:

NagEventLog allows you to have windows event log entries filtered and passed back to your Nagios server.  Two methods I have used are:

  • Report ALL errors in ALL logs and filter select EventIDs we don't need to worry about.
  • Report a -specific- error that we use to trigger an event script.  Eg a "cleanup and restart" process upon a service failure.

Updating NagEventLog Filters via GPO

When you have alot of Windows Servers and would like to add an EventID to the Filter, it is a real pain to update on a server by server basis.  So using a GPO object, you can control the filters directly from a policy without having to manually update each individual server.

Assumptions

  • You install NagEventLog in a consistent fashion on all servers
  • You want to filter the same items across ALL your servers
  • All your servers are members of the local domain

Instructions

  1. Create a custom administrative policy template.  Below is the "nageventlog.adm" file I used to filter out select Event IDs.
    ; nageventlog.adm
    ;;;;;;;;;;;;;;;;;;;;;
    CLASS MACHINE  ;;;;;;
    ;;;;;;;;;;;;;;;;;;;;;
     
    CATEGORY !!nagiosfilter
    KEYNAME "SOFTWARE\Wow6432Node\Cheshire Cat\Nagios\Filter0"
        POLICY !!changenagiosfilter
            PART !!NotEventID CHECKBOX
                VALUENAME "notID"
                VALUEON NUMERIC 1
                VALUEOFF NUMERIC 0
            END PART
            PART !!ChangeFilter0IDs EDITTEXT REQUIRED
                VALUENAME "ID"
                DEFAULT !!filterdefault    
            END PART
            PART !!changefilter0IDstext TEXT END PART
        END POLICY
    END CATEGORY

    [STRINGS]
    nagiosfilter="Nagios Filtering"
    changenagiosfilter="Change Nagios Filter0"
    ChangeFilter0IDs="Event IDs that are ignored by Nagios"
    changefilter0IDstext="Comma seperated list of Event IDs to exclude"
    filterdefault="21293,21248,26020,26009"

  2. Add the new nageventlog.adm file to C:\windows\inf folder of your domain controller.
  3. Next, we need to add the template to our default policy.  Launch the GPO Editor by clicking Start > Run > mmc.   Add the "Group Policy Object Editor" Snap-in, click Browse, and choose the Default Domain Policy.
  4. Right-click "Administrative Templates" and choose Add/Remove templates.  Select the template file, nageventlog.adm, we created.
  5. You should now see an item appear as "Nagios Filtering".  If you select it and the "Change Nagios Filter0" does not appear, click View > Filtering and DE-select the "Only show policy settings that can be fully managed".
  6. Select "Enabled" option, click the checkbox to enable the EXCLUSION of the IDs and enter the comma delimited list of EventIDs.
  7. Servers will update automatically with their regular policy refresh.  To force a policy update, you can use "gpupdate" from the command line.

You can use the technique above to do a variety of things and tweak things from a central location across the domain environment.

References

Windows Server 2008 NagEventLog Compatibility

While the 64bit version of NagEventLog v1.9.1 installed on my 64bit Windows 2008 server, I was unable to use the GUI to configure the filters.  However if you visit Steve Shipway's NagEventLog site directly, you can download replacement executables that allow it to properly run in Server2008.  I replaced the files, restarted the service and then GUI tool worked correctly.

Addon - Nagios Passive Checks with NSCA

Using Nagios with NSCA, you can configure some complex scripts / tasks to output status codes and messages to be sent to your Nagios server for collection / reporting.  To start, you will need to install NSCA package on your Nagios server and configure the listening server as outlined in the documenation.

NOTE: You will need libmcrypt and libmcrypt-devel packages installed to compile successfully.

You will most likely want to create a template or two to use with your passive checks.  Below is the example template I created for testing passive checks...

define service{
        name                    passive-service
        use                     generic-service
        check_freshness         1
        passive_checks_enabled  1
        active_checks_enabled   0
        is_volatile             0
        flap_detection_enabled  0
        notification_options    w,u,c,s
        freshness_threshold     57600     ;12hr
}

Then configure a service like so...

define service{
     use                     passive-service   
     host_name               localhost
     service_description     test
     check_command           check_dummy!3!"No Data Received"
}

On the remote server, you will need to do the same to compile the components.  You will only need the send_nsca binary and the send_nsca.cfg file.  You will need to tweak your send_nsca config file to match the information you configured on your NSCA server.

Now the fun begins where you can create/modify scripts to send these passive check results to Nagios via the NSCA server.  I used a simple perl script below for my testing.

#!/usr/bin/perl
#############################################################
# RETURN CODES:
# 0-OK, 1-WARNING, 2-CRITICAL, 3-UNKNOWN
#############################################################
#CONFIG FILES
#$debug=1;
$config="/usr/local/nagios/etc/send_nsca.cfg";
# LOCAL SYSTEM CONFIG OPTIONS
$nsca_host="nagios.hubteam.com";
$host="host_name";
$service="service_name";
# DEFAULT RETURNS
$code=3;
$result="WHAT THE HECK?";
# COMMAND LINE
$send_nsca="/usr/local/nagios/bin/send_nsca -c $config -H $nsca_host";
# Start
# INSERT YOUR FUN CODE HERE, Setting a $code and $result value
# End
if ($debug) {print "SENDING:  $host\t$service\t$code\t$result\n";}
open(SEND,"|$send_nsca") || die "Could not run $send_nsca: $!\n";
print SEND "$host\t$service\t$code\t$result\n";
close SEND;

There are several points to consider.

  • If the script takes  < 10 seconds, you may also consider running checks via NRPE and custom command defintions. 
  • You can have multiple checks report passive checks back to the SAME host/service combo.  Eg, running various nightly jobs and direct any errors go to a single "nightly-jobs" monitor.
  • Read the Nagios documenation on passive checks and freshness.

Nagios Custom Object Variables

In large Nagios environments, configuring everything at the host level can be cumbesome.  Nagios has nice grouping / templating features that make deploying checks alot faster as well as easier to manager.  Sometimes you may need to "customize" the check to the specific host.  For example, specify the databasename on the indivudal database server to query.  This is where Nagios "Custom Object Variables" come into play.

As always, you can find some very useful information in the Nagios documentation.

In my case, we will start with defining the custom object variable on the host object by adding a like in the "define host {" block like so:

define host{
        use             server-template
        host_name       dbserver1
        alias           DB Server 1
        address         dbserver1.domain.local
        _DATABASE1              DB01
        }

I have a hostgroup definition for "Database Servers" and a list of common checks for each database server.  You can see how I have my Nagios check configured to use the local variable in the hostgroup definition....

define hostgroup{
        hostgroup_name  database-servers
        alias           Database Servers
        members         dbserver1,dbserver2,dbserver3
        }
define service{
        use                     template
        hostgroup_name          database-servers
        service_description     database-test
        check_command           check_mssql!username!password!-p 1433 -D \
                                $_HOSTDATABASE1$ -w 3 -c 5 -q "exec \
                                $_HOSTDATABASE1$.dbo.sp_test" -s -W 10 -C 20
        }

Note that the backslashes are only for readability here, and the check is a single line in my definition.

Remeber when using the custom variables, they always start with the underscore and then prefixed with the type of variable... HOST, SERVICE, CONTACT, etc. 

Nagios Event Handler - Restart Remote Service

I wrote a few quick posts on using Nagios Event_Handlers to restart a service on the local system.  Mostly I followed the example from the Nagios documentation, but it was a little tricky using SUDO to restart a service.  Once I solved that, the logical next step was to be able to restart a service on a REMOTE system with the event_handlers and NRPE.

NAGIOSSVR runs nagios and monitors itself and WEBSVR.  I use the "check_linux_procs" script which is also known as "check_system_procs".  On the remote server WEBSVR, the script configuration lines look something like:

# Processes to check
PROCLIST_RED="httpd sendmail nrpe"
PROCLIST_YELLOW="crond"

# Ports to check
PORTLIST="25 80 5666"

The check_linux_procs is executed on the remote server via NRPE.  We can use NRPE to remotely execute event handlers as well as service checks.  Setup is a bit more complex than a local host configuration. 

Proper SUDO configuration is required on the remote system, WEBSVR.   Read my other post on the Nagios Local Sevice Restart with Event_Handlers for the more information on the SUDO settings.

On WEBSVR I created a very simple script that uses sudo to restart the services.  Something like:

#!/bin/sh
#
/usr/bin/sudo /sbin/service httpd restart
/usr/bin/sudo /sbin/service sendmail restart
exit 0

NRPE is not listed because... well, if NRPE crashes the event_handler cannot run since it uses NRPE to connect and execute the script.  Do not foget to add your script to your nrpe.cfg file like below and restart NRPE on WEBSVR.

command[remote_restart]=/usr/local/nagios/libexec/eventhandlers/remote-restart

On NAGIOSSVR, this service check has a max_check_attempts of 3.  So I had to tweak the script I used before.  The trick here is passing the right variables through.  In my Nagios commands.cfg I added the $HOSTADDRESS$ value to the end of the line like so:

command_line    $USER1$/eventhandlers/restart-services-remote \
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$

And the local "event_handler" script on NAGIOSSVR looks similar to the localhost example, but the "sudo" restart command is replaced with:

/usr/local/nagios/libexec/check_nrpe -H $4 -c remote_restart

Don't forget to change the case logic if you need to adjust for a different max_check_attempts value in your config.

NOTES: 

  • You could break this into two checks and event_handlers, but I just restart both services to keep it as simple as possible. 
  • You may also try use key-based SSH w/o a password as an alternative to NRPE.  That may be my next tweak to work around NRPE itself crashing.

HINTS: 

  • Always double check script ownership and permissions.  I had forgot to make the script executable on WEBSVR and that held me up for a few trying to sort it out.

Nagios Event Handler - Restarting a Local Service

Using Nagios Event Handlers you can perform an action based on the results of a Nagios check.  A very straightforward example would be to restart a service.  However it is not as simple as you might think.

I use the "check_system_procs" on the localhost of my nagios server itself to check a few services and restart them all should one no longer be running.  Since my nagios server is a VPS with limited resources, it sometimes runs out of memory and well... things die.

We need to configure the check and the check's event-handler like so:

define service{
        use                             local-service
        host_name                       localhost
        service_description             daemons
        check_command                   check_nrpe!check_daemons
        event_handler                   restart-services
}

In your nagios.cfg, makes sure you have "enable_event_handlers=1" to enable the event handlers.  There are several other values in the config file you may wish to alter such as the event_handler_timeout.

In your commands.cfg file, make sure you have event_handler defined something like:

define command{
        command_name    restart-services
        command_line    /usr/local/nagios/libexec/eventhandlers/restart-services \
        $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
        }

The problem we have is that the event_handler runs as the Nagios user, which tyipcally will not be able to restart a service.  To test this, just "su - nagios" and try to restart sendmail or apache.  We can work around this by using SUDO.   Edit the SUDOERS file (visudo) and add something like the lines below to the end of the file.

User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /sbin/service
Defaults:NAGIOS !requiretty
NAGIOS    ALL=(ALL)    NOPASSWD: NAGIOSCOMMANDS

Essentially we're defining users and commands that can be run via SUDO, without a password, and without a session. 

Attached is the script I use (found it on the web) for the scenario described above.  Do not forget to make sure the script has the appropriate ownership / permissions.  Try executing the script as the nagios user to test it prior to setting up the event_handler in Nagios.

AttachmentSize
event_handler_script.txt2.98 KB

Nagios, NagVis and PNP4Nagios Example

Nagios, NagVis and PNP4Nagios Example

A vanilla out of the box example of the Nagios/NagVis/PNP4Nagios integration. The usual installation pains of all the dependencies required for the packages. Setup was not too difficult following the documentation. I created a simple hardware diagram in Visio in this example. I added icons for HOST status and the CPU Load and Root Partition service checks. I updated the "hover" template for NagVis to show the PNP4Nagios graphs for the services.

As you can see, you have the ability to create some slick visuals. You can create a high level dashboard and drill down to more detailed maps. Of course this all works much better when your hardware and logical layouts are relatively static. In a very dynamic environment Nagios can be an administrative pain and this only increases the complexity.

Nagios - Switch Interface Traffic

I recently wanted to start monitoring some ports on my switch stack. Specifically several uplink ports and several trunk ports. Doing a little research I found the best plugin was the "check_iftraffic3" plugin available from the Nagios Plugin Exchange. Ref: http://exchange.nagios.org/directory/Plugins/Network-Connections%2C-Stats-and-Bandwidth/check_iftraffic3/details

I modified the perl script slightly to format the output a bit differently. The biggest trick to determine the interface ID. Using SNMPWALK on my Nagios server I was able to look at the various interfaces in my switching environment.

snmpwalk -v 2c -c public aaa.bbb.ccc.ddd ifTable

Configure a new check command in the standard fashion and off you go! Oh, I had tweaked the output slightly and created a PNP4NAGIOS template to better display the IN/OUT data on the same graph vs. individual graphs where the "scale" of the graph could be misleading. I'll attach that info as a TXT file.

AttachmentSize
check_traffic3_php.txt1.38 KB

Plugin: check_dns_secondary - Checking NS Servers

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1948.html;d=1

NOTE: May require installation of additional perl modules.

I renamed the check to "check_ns_servers" on my install to be a little more obvious as to its function.  After several DNS hosting provider outages which would manifest a wide array of errors over odd periods of time thanks to DNS caching, I wanted a Nagios plugin to check to make sure our DNS was working correctly.  Sure there is "check_dns" which I also use to check the resolution of a name to correct IP, but I wanted something a bit more powerful.

"check_dns_secondary" will query for the name servers of the provided domain.  Each NS server is queried individually for the SOA record of the domain.  An error is generated if any server is not functioning, or not authoritative.  A warning is generated if any server lags the others in serial-number.

Plugin: check_http_requisites - Page Size, Files, and Loadtime

Summary

A Nagios module written in Python that downloads the page and embedded elements using 'wget' to measure a more realistic total page load time value.  The total number and size of elements as well as the time it took to load the them is returned.  A warn/critical alert can be triggered by the total load time.

Usage Example

Ideally you want your total page load time to be less than a few seconds.  This means making sure your images are sized correctly and in the correct format.  eg. a JPEG vs. a large BMP file.  Also any "embedded" objects like externally referenced image/media files do not slow your site down.  Or perhaps your website is under load and just not responding in a timely fashion.

Using this plugin we can relatively monitor the load time of select sites and/or pages within a site.  Note that the check is somewhat dependent on the system executing the check and its network bandwidth.  While unlikely to be running Nagios over dialup, bandwidth limitations and other traffic could definitely affect the total page load time.  Also this could alert you in the case of a sub-optimal routing, latency, and/or packet loss issue.  What I like to call the "TII", aka Transient Internet Issue.

Adding to NagiosGraph

Of course graphs are always visually pleasing and allow you to make your point about what happens when Marketing uploads 1meg BMP files instead of the recommended JPEGs.  Here is the NagiosGraph map file entry I added.


# Service type: check_http_complete
# output: OK - Downloaded: 149K bytes in 8 files in 0.83 seconds
# perfdata: time=0.83;size=149K;number=8
/perfdata:.*time=([.0-9]+);size=(\d+)K;number=(\d+)/ and
push @s, [ http_complete,
[ sec, GAUGE, $1 ],
[ KB, GAUGE, $2 ],
[ files, GAUGE, $3 ] ];

Available Here : http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1352.html;d=1

Plugin: check_mem - Linux Memory Usage

A plugin written in perl to monitor and check thresholds for memory based on the output of the 'free -mt' command.

Ref: Nagios Exchange - check_mem page

Installation

  1. Copy the file to your /usr/local/nagios/libexec directory of the host you are monitoring
  2. Set the file mode to 755
  3. Add a line to your nrpe.cfg file and restart the service.
  4. command[check_mem]=/usr/local/nagios/libexec/check_mem -w 80,20 -c 95,50
  5. Add the NRPE check to the appropriate configuration file on your Nagios server like:
define service{
        use                     servicetemplate2   
        hostgroup_name          linux-servers
        service_description     tmp
        check_command           check_nrpe!check_tmp
}

Command Line Syntax

# /usr/local/nagios/libexec/check_mem -w 50,20 -c 80,50
<b>WARNING: Memory Usage (W> 50, C> 80): 72% <br>Swap Usage (W> 20, C> 50): 0%</b> \
|MemUsed=72%;50;80 SwapUsed=0%;20;50

Display the Data

To add this to your "map" file for NagiosGraph, append the code to capture the data like below:

# Service type: check_mem
#   check command: check_nrpe!check_mem -w 50,10 -c 80,25
#   output: <b>CRITICAL: Memory Usage (W> 80, C> 95): 100% <br>Swap Usage (W> 20, C> 50): 0%</b>
#   perfdata: MemUsed=100%;80;95 SwapUsed=0%;20;50
/perfdata:.*MemUsed=(\d+)%;(\d+);(\d+).*?SwapUsed=(\d+)%;(\d+);(\d+)/
and push @s, [ memory,
       [ ramuse, GAUGE, $1 ],
       [ swapuse, GAUGE, $4 ] ];

Dumping Linux Buffer Cache

TOP screenshot

A useful command to check linux system resources is "top".  However with the buffer cache you may see almost no available memory (see attached screenshot).  But how can that be?  All you have running may be a java app, apache, and a few other services.  There is no way that should be using ALL of that RAM.  In my case, I have the Nagios "check_mem" plugin querying for available memory and throwing alerts quite regularly.

The "free -mt" command can show you how much memory is cached.  That eases my mind a bit.  While googling about buffer cache, I stumbled upon this article:

http://devcs.blogspot.com/2007/12/linux-buffer-cache-how-to-disable-it.html

This article gives a good rundown of what the buffer cache is all about.  Also it mentions a nice little trick to dump the entire cache.

echo 1 > /proc/sys/vm/drop_caches

VOILA!  Cache dumped and Nagios is happy.  Just dumping the cache shouldn't be taken lightly as it MAY have some adverse effects depending on your server.  However the cache should slowly start to build back up. 

Looking into the actual check_mem script, you can have it exclude the buffers in the calculation of free memory.  Check this line and make sure the value is "1":

my $DONT_INCLUDE_BUFFERS = 1;

Plugin: check_sql - Check MSSQL and MYSQL servers

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1435.html;d=1

Written in Perl.  Requires FreeTDS to be install.

This plugin can query a Microsoft SQL Server or a MySQL Server. The plugin can also execute specific queries or stored procedures and return the results based. The results can then be compared via thresholds for numeric values or via regular expressions for string values.

Here is a good example of how I used the plugin in my blog.  Count Log Entries Stored Procedure and check by Nagios.

Plugin: check_svn - Check Subversion

Summary

Check_svn is a nagios check written in Python which will check the availability of your SVN repository from your Nagios server.

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1

ERROR: CHECK_SVN - Error Connecting

Summary

The "check_svn" plugin worked from the command line.  However when I attempted to configure the Nagios check, an error occurred.

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1

SVN CRITICAL: Error connecting to svn server - Can't open file '/root/.subversion/servers': Permission denied .

Solutions

Apparently this is caused by nagios environment issue.  The nagiosexchange page recommends one solution.  Alternatively you can modify the check_svn script.

  1. Edit the command line in the nagios commands.cfg file to export the HOME variable:
    command_line export HOME=/home/nagios && $USER1$/check_svn
  2. Edit the "check_svn" check script to pass a command line variable
    Add a variable like:  self.confdir    = "/home/nagios"
    and edit "cmd" line the script builds after the if statements for username/password:
            if self.confdir:
                cmd += " --config-dir=%s" % self.confdir

All should work well now.  I also make sure to use the "-T" option to output the test execution time which I can now graph with nagiosgraph.

Tweak - Nagios Jabber / XMPP Notifications

Image Nagios Openfire Notification

I wanted to add the feature to send nagios alerts via Jabber, aka XMPP protocol, instant messages.  Our office uses the Openfire platform for corporate instant messaging.  Some googling found this:

I modified the server connection and user variables to fit my Openfire installation.  However after running some simple command line tests, I could not get it to work.  The unable to connect errors were easy enough to understand and fix, but then I got an unauthorized error.

ERROR: Authorization failed: error - not-authorized

The link below has information on solution to the error.  OK, so now we assume you CAN send a test message via Jabber/XMPP.  Let's configure it.

There are two primary scenarios for configuring the Jabber/XMPP notifications. 

Scenario 1 - User wants to be contacted for ALL Nagios alerts via Jabber/XMPP. 

Define the contact command something like:

# 'notify-by-jabber' command definition
define command{
        command_name    notify-by-jabber
        command_line    /usr/local/nagios/bin/notify_via_jabber \
             $CONTACTADDRESS1$ "$NOTIFICATIONTYPE$ $HOSTNAME$ \
             $SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
        }

*note - I'm using the backslash for readability only, actually only one line.

Notice how I defined "$CONTACTADDRES1" in the command line.  Now check out  my Contact definition...

define contact{
        contact_name                    jdoe
        use                             generic-contact
        alias                           John Doe
        email                           jdoe@company.com
        address1                        jdoe
        service_notification_commands   +notify-by-jabber
        }

This will use the generic contact info for the email method -and- add the jabber contact method.  Note the "address1" line I am using for the appropriate jabber_id for the user.  Alternatively you could create a contact template called "jabber-contact" using the notify-by-jabber command and then apply both templates to the user contact definition.

Scenario 2 - User only needs specific Nagios service/host alerts via Jabber/XMPP.

Unfortunately this is simple.  You would need to define a second contact entirely to assign to the specific host/service.  Create a "jdoe-jabber" contact so on your Nagios host/service definition you would have a line like:

contacts          jdoe,jdoe-jabber

Maybe with an add-on or future version of Nagios we could define a user/contact method along the lines of:

contacts           jdoe:notify-by-jabber,jdoe:notify-by-email

Anyone listening?  Feel free to send me a note or comment.

AttachmentSize
jabber-xmpp-notications.txt2.45 KB

ERROR - Nagios XMPP Notification with Openfire

ERROR: Authorization failed: error - not-authorized

Obviously this means I am connected am communicating with the server.  So I flipped TLS variable on/off a few times.  I also turned on debug logging on my Openfire server and could see the connections.  With TLS "on" or set to "1", I received this error:

Can't use an undefined value as a HASH reference at /usr/lib/perl5/site_perl/5.8.8\
/XML/Stream.pm line 1165.

So I turned TLS off and started working with the not-authorized error.  Googled around and found the a fix that worked.  I had to edit the Protocol.pm perl module to fix the authentication error.  I found my file here...

/usr/lib/perl5/site_perl/5.8.8/Net/XMPP/Protocol.pm

and just commented out the line:

return $self->AuthSASL(%args);

Now I can annoy myself with all my Nagios alerts via IM as well as email!  Attached is the file with the text used for the commands.cfg entry and the perl script to send the notifications.

Tweak - Nagios SMS Messaging

Want to send SMS messages from Nagios?  SMS messages sometimes blocked when sending them via <phone#>@provider.com?  Sending alot of SMS messages?  Sounds like you need an SMS Gateway Provider.

In our corporate environment we wanted a more reliable/consistant SMS messaging system to work with our Nagios monitoring environment.   A little research quickly led us to Clickatell.  To keep things as simple as possible, we setup Nagios to use the SMS Gateway via SMTP API.

Now we had to configure Nagios.  First off, we needed to create a notification method/command.  Here is the command.cfg entry we created:

define command{
        command_name    notify-service-by-sms
        command_line    /usr/bin/printf "%b" "api_id:<API_ID> \nuser:<USERNAME> \npassword:<PASSWORD> \nto:$CONTACTPAGER$\nreply:<REPLY_ADDY> \ntext:$NOTIFICATIONTYPE$ $HOSTALIAS$-$SERVICEDESC$ $SERVICESTATE$\ntext:Address-$HOSTADDRESS$\ntext:Additional Info-$SERVICEOUTPUT$" | /bin/mail -s "$HOSTALIAS$-$SERVICEDESC$ is $SERVICESTATE$" $CONTACTEMAIL$
        }

Bold items in brackets you must specify based on your environment.  Notice how we had to significantly strip down the info sent via SMS text message.  We found the above was simple and communicated the required info. 

Then let's create a contact to use...

define contact {
        contact_name    sms
        use             generic-contact
        alias           SMS Alert
        email           sms@messaging.clickatell.com
        pager           16665551234,16665554321
        service_notification_commands   notify-service-by-sms
}

The phone numbers to SMS are just a comma seperated list.  It is important to note the phone numbers must have a "1" before them.

Now simply add the "sms" contact in the service definitions you want to alert by SMS text messages.    Reload Nagios and you should be off and running.

Tweak - check_file_age to check_file_modified

Out of the box the Nagios Plugins package has a check_file_age plugin.  Well, that only checks to see if the file has been modified in the last specified time period and alerts if the file has NOT changed recently.  I needed the exact opposite, to check to see if the file has been modified. 

The "reversal" can be accomplished by changing two ">" symbols to "<" in the comparisons of the function.  Of course, I changed the few name occurances of the plugin to a new name as well.  This is a good sanity check for when any critical config file that I designate is changed, it will alert the IT team.  Great for developers with sudo access touching key config files they should not be, such as the http.conf file.

Tweak: Using NagiosGraph's SHOW.CGI

Using NagiosGraph with Nagios can provide valueable information about your environment.  At times, you may want to show something on a different scale or limit the data seen within the graph.  Below are a few basics for manipulating NagiosGraph's show.cgi to customize the graph you are viewing.

Example

We use the "check_mssql_monitor 0.9.0" with our Microsoft SQL Clusters and graph the results.  My NagiosGraph "map" file graphs the CPU, IO, IDLE, and response time of checks and Nagios notes_url for the check links to the default graph.  However the scale of values typically render the IO nearly flat, yet a small change in IO can be significant.  Now we want to just generate IO graphs of each of our clusters to compare them to each other. 

Here's the default URL of the graph we are working with: 

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo

mssql_monitor nagios nagiosgraph before

By manipulating the URL, we can do some handy tweaks.  Let's start with only graphing the IO.  To do that, we need to add the appropriate options to the URL.  Adding the datasource name and valuename to the URL like this:

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io

My favorite option is make the graph bigger!  After all, bigger is better right?  Just add the geometry option to the URL likse so:

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io&geom=700x200

mssql_monitor nagios nagiosgraph after

Now we have a much larger graph and are able to see the IO response increasing during the given time period.  Next step is to figure out why!

HINT:  for the db source name, check out the default graph and look for the name immediately under the graph.  Then add the value as it appears in the legend.

Tweak: check_sql - Allow decimal values

While building another stored procedure that I execute by check_mssql in Nagios, I noticed a little hiccup.  My stored procedure was returning a value like "85.67".  When I executed the check_sql on the command line to run the procedure, I got a strange error...

# ./check_mssql -H mssqlclus1 -U username -P password -p 1433 -D database -w 20 -c 35 \
> -q "exec database.dbo.sp_GetDatabaseFileMetrics database,Used,Log,1" -W 90.00 -C 95.00 -s
CHECK_MSSQL CRITICAL - Result is not numeric with result threshold defined (0.089992 seconds) \
| time=0.089992s;20;35

Now that does not make sense at all.  I removed the -W and -C constraints and got:

CHECK_MSSQL OK - SQL Server result: 98.10 (0.122316 seconds) | time=0.122316s;20;35

I do not know about you, but "98.10" looks like a numeric to me.  So I opened up the perl for the check_mssql and looked for the conditions that triggered the error.  This was the regular expression it was evaluating to determine if the value returned was a numeric instead of a string.

$result =~ /^[-+]?\d+$/

Well, that does not do the trick if I have a value like "98.10".  A value of "98" would have been fine.  I freely admit I am no "code guru" by any means, but I figured I shoudl be able to come up with a fix for this.  I copied the stored procedure to 'check_mssql2' and went to work.  I created an OR condition to look for the integer regular expression or a decimal regular expression.  There may be a better way, but this worked for me.  I changed this:

!($result =~ /^[-+]?\d+$/)) {

to this:

!(($result =~ /^[-+]?\d+$/) || ($result =~ /^[-+]?\d+\.\d+$/))) {

I ran my tests and it worked great!  A really useful link I found was this Regular Expression Validator at http://www.sweeting.org/mark/html/revalid.php.  It also has some very handy reference info on the bottom which I found useful since I do not have the pleasure of writing them on a daily basis.

Common Errors & Fixes

From my experiences with Nagios, NagiosGraph, Webinject and various other plugins and modules...
Please comment and even email me errors / fixes and I will add them and link back to a site, etc.  Or if you want to go for the gold, request to be a contributor to the site.

ERROR - GD, PNG, and/or JPEG libraries could not be located

Running through a new Nagios 3.2.0 install.  Downloaded all the components I'm going to need on my new VPS.  Installed gcc, php, and a few others I knew I would need.  However when I ran the configure I got the following error:

*** GD, PNG, and/or JPEG libraries could not be located... *********

I forgot to add gd-devel to my install checklist.  That was an easy one.

ERROR: "undefined symbol: art_alloc"

Having a problem getting graphs to generate? I found the following error in my /var/log/httpd/error_log:
[Mon Oct 06 16:54:56 2008] [error] [client 204.9.220.36] /usr/bin/perl: symbol lookup error: /usr/local/rrdtool/lib/librrd.so.2: undefined symbol: art_alloc, referer: http://monitor1.server.com/nagiosgraph/show.cgi?host=pg1&service=WebInject&geom=700x200

I had previously installed several versions libart_lgpl in my intial attempts to get NagiosGraph and RRDTool working. Turns out "art_alloc" is indeed undefined in libart_lgpl 2.3.17, but it is defined in 2.3.19. It may have been the default 2.3.17 rpm that I installed. However downloading and installing 2.3.19 fixed the issue after I sorted out the path issues.  Also you could shortcut and symlink any old files to the new install vs. configuring PATHs which always seem to get messy for me.

Here's how I checked the art_alloc symbol:


[root@monitor1 lib]# nm -D /usr/local/lib/libart_lgpl_2.so.2.3.19 |grep art_alloc
000033d0 T art_alloc
[root@monitor1 lib]# nm -D /usr/lib/libart_lgpl_2.so.2.3.17 |grep art_alloc
[root@monitor1 lib]#

ERROR: libart-2.0 - Missing libart when installing RRDtool

configure: WARNING:
----------------------------------------------------------------------------
* I found a copy of pkgconfig, but there is no libart-2.0.pc file around.
  You may want to set the PKG_CONFIG_PATH variable to point to its
  location.
----------------------------------------------------------------------------

The issue is the RRDtool configure script is looking in /usr/include vs. /usr/local/include for the libart files.  So there are a few ways to tweak this.

  1. You can symlink the files from where it is looking to where they are.  Can be very messy to get working, but you may unwittingly solve future issues where these files are needed.
  2. You could change where you install libart to not use /usr/local
  3. You can set the correct environment variable:
    export CPPFLAGS=' -I/usr/local/include/libart-2.0'

That solved my RRDTool installation issue.

FIX - CHECK_ESX3.PL Script

In my Nagios testing I was trying to work with the CHECK_ESX3.PL script to run some scripts against the ESX hosts in the environment. I installed the required VMware vSphere Perl SDK (latest from VMware's site). I was able to run "./check_esx3.pl" without any errors. However when I tried to run an actual check against a host I received:

CHECK_ESX3.PL CRITICAL - Server version unavailable at 'https://172.16.0.81:443/sdk/vimService.wsdl' at /usr/lib/perl5/5.8.8/VMware/VICommon.pm line 545.

I ran across the threads about how the the latest LWP does not like the self-signed certificates. So I added this line to the top of the perl script:
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;

However that did not seem to correct the issue (but it turned out it was one of several issues). I noticed when I rebuilt the VMware vSphere Perl SDK I saw this error:

The following Perl modules were found on the system but may be too old to work
with vSphere CLI:

Compress::Zlib 2.005 or newer
HTML::Parser 3.60 or newer
URI 1.37 or newer
XML::SAX 0.16 or newer

I updated those perl modules as well without seeing a difference. What I eventually did was manually edit the VICommon.pm module. I looked at line 545 which tries to parse the response data. So I added a line to just print out the data prior to that step. That's when I saw the message about a proxy error. Turns out I had http_proxy, ftp_proxy, and https_proxy environment variables set from another idea I had been toying with. I removed the environment variables and I was off and running! So I actually had two issues, the self-signed certificates and the bad proxy environment variables.

Error - NSClient Counter Errors

***** Nagios *****

Notification Type: PROBLEM

Service: CPU Load
Host: BALLYs DBA-1
Address: ballys.hubteam.local
State: UNKNOWN

Date/Time: Mon Dec 28 12:53:20 CST 2009
Duration: 0d 0h 2m 1s
Additional Info:
NSClient - ERROR: Could not get data for 5 perhaps we dont collect data this far back?

-------------------------------------------------
-------------------------------------------------
***** Nagios ***** Notification Type: PROBLEM Service: Memory Usage Host: BALLYs DBA-1 Address: ballys.hubteam.local State: UNKNOWN Date/Time: Mon Dec 28 12:52:20 CST 2009 Duration: 0d 0h 1m 31s Additional Info: NSClient - ERROR: Failed to get PDH value.

I received several odd errors with the Memory, CPU and Uptime monitors on one of our new servers with NSClient++.  A little quick research led me to the solution below.  Do not forget to restart the NSCLIENT service too after resetting the counters.

lodctr /R

** Make sure the "/R" is an upper case R.