Data centers are growing and administrative staffs are shrinking, necessitating efficient monitoring tools for compute resources. Part 1 of this series discussed the benefits of using Ganglia and Nagios together, then showed you how to install and extend Ganglia with homemade monitoring scripts.
Recall from Part 1 the multiple definitions of monitoring (depending on the implier and the inferrer):
- If you're running applications on the cluster, you think: "When will my job run? When will it be done? And how is it performing compared to last time?"
- If you're the operator in the network operations center, you think: "When will we see a red light that means something needs to be fixed and a service call placed?"
- If you're in the systems engineering group, you think: "How are our machines performing? Are all the services functioning correctly? What trends do we see, and how can we better utilize our compute resources?"
You can find code to monitor exactly what you want to monitor and that code can be of the open source variety. The most difficult part of using open source monitoring tools comes when you attempt to implement an install and puzzle out a configuration that works well for your environment. Two major problems with open source (and commercial) monitoring tools are the following:
- No tool will monitor everything you want the way you want it.
- Much customization could be required to get the tool working in your data center exactly how you want it.
Ganglia is a tool that monitors data centers and is used heavily in high-performance computing environments (but it's attractive for other environments too like clouds, render farms, and hosting centers). It is more concerned with gathering metrics and tracking them over time compared with Nagios's focus as an alerting mechanism. Ganglia used to require an agent to run on every host to gather information from it, but now metrics can be obtained from just about anything through Ganglia's spoofing mechanism. Ganglia doesn't have a built-in notification system, but it was designed to support scalable built-in agents on target hosts.
After reading Part 1, you could install Ganglia, as well as answer the monitoring questions that different user groups tend to ask. You could also configure the basic Ganglia setup, use the Python modules to extend functionality with IPMI (the Intelligent Platform Management Interface), and use Ganglia host spoofing to monitor IPMI.
Now, let's look at Nagios.
This part shows you how to install Nagios and tie Ganglia back into it. We're going to add two features to Nagios that'll help your monitoring efforts in standard clusters, grids, clouds (or whatever your favorite buzzword is for scale-out computing). The two features are all about:
- Monitoring network switches
- Monitoring the resource manager
In this case, we'll be monitoring TORQUE. When we are finished, you'll have a framework to control the monitoring system of your entire data center.
Nagios, like Ganglia, is used heavily in HPC and other environments, but Nagios is more of an alerting mechanism that Ganglia (which is more focused on gathering and tracking metrics). Nagios previously only polled information from its target hosts, but has recently developed plug-ins that allow it to run agents on those hosts. Nagios has a built-in notification system.
Now let's install Nagios and set up a baseline monitoring system of an HPC Linux? cluster to address the three different monitoring perspectives:
- The application person can see how full the queues are and see available nodes for running jobs.
- The NOC can be alerted of system failures or see a shiny red error light on the Nagios Web interface. They also get notified via email if nodes go down or temperatures get too high.
- The system engineer can graph data, report on cluster utilization, and make decisions on future hardware acquisitions.
The effort to get Nagios rolling on your machine is well documented on the Internet. Since I tend to install it a lot in different environments, I wrote a script to do it.
First you need to download two packages:
- Nagios (tested with version 3.0.6)
- Nagios-plugins (tested with version 1.4.13)
The add-ons include:
- The Nagios Event Log, which allows for monitoring Windows event logs
- The NRPE, which provides a lot of Ganglia functionality
Get the tarballs and place them in a directory. For example, I have the following three files in /tmp:
- nagios-3.0.6.tar.gz
- nagios-plugins-1.4.13.tar.gz
- naginstall.sh
Listing 1 shows the naginstall.sh install script:
Listing 1. The naginstall.sh script
#!/bin/ksh NAGIOSSRC=nagios-3.0.6 NAGIOSPLUGINSRC=nagios-plugins-1.4.13 NAGIOSCONTACTSCFG=/usr/local/nagios/etc/objects/contacts.cfg NAGIOSPASSWD=/usr/local/nagios/etc/htpasswd.users PASSWD=cluster OS=foo function buildNagiosPlug { if [ -e $NAGIOSPLUGINSRC.tar.gz ] then echo "found $NAGIOSPLUGINSRC.tar.gz building and installing Nagios" else echo "could not find $NAGIOSPLUGINSRC.tar.gz in current directory." echo "Please run $0 in the same directory as the source files." exit 1 fi echo "Extracting Nagios Plugins..." tar zxf $NAGIOSPLUGINSRC.tar.gz cd $NAGIOSPLUGINSRC echo "Configuring Nagios Plugins..." if ./configure --with-nagios-user=nagios --with-nagios-group=nagios -prefix=/usr/local/nagios > config.LOG.$$ 2>&1 then echo "Making Nagios Plugins..." if make -j8 > make.LOG.$$ 2>&1 then make install > make.LOG.$$ 2>&1 else echo "Make failed of Nagios plugins. See $NAGIOSPLUGINSRC/make.LOG.$$" exit 1 fi else echo "configure of Nagios plugins failed. See config.LOG.$$" exit 1 fi echo "Successfully built and installed Nagios Plugins!" cd .. } function buildNagios { if [ -e $NAGIOSSRC.tar.gz ] then echo "found $NAGIOSSRC.tar.gz building and installing Nagios" else echo "could not find $NAGIOSSRC.tar.gz in current directory." echo "Please run $0 in the same directory as the source files." exit 1 fi echo "Extracting Nagios..." tar zxf $NAGIOSSRC.tar.gz cd $NAGIOSSRC echo "Configuring Nagios..." if ./configure --with-command-group=nagcmd > config.LOG.$$ 2>&1 then echo "Making Nagios..." if make all -j8 > make.LOG.$$ 2>&1 then make install > make.LOG.$$ 2>&1 make install-init > make.LOG.$$ 2>&1 make install-config > make.LOG.$$ 2>&1 make install-commandmode > make.LOG.$$ 2>&1 make install-webconf > make.LOG.$$ 2>&1 else echo "make all failed. See log:" echo "$NAGIOSSRC/make.LOG.$$" exit 1 fi else echo "configure of Nagios failed. Please read $NAGIOSSRC/config.LOG.$$ for details." exit 1 fi echo "Done Making Nagios!" cd .. } function configNagios { echo "We'll now configure Nagios." LOOP=1 while [[ $LOOP -eq 1 ]] do echo "You'll need to put in a user name. This should be the person" echo "who will be receiving alerts. This person should have an account" echo "on this server. " print "Type in the userid of the person who will receive alerts (e.g. bob)> \c" read NAME print "What is ${NAME}'s email?> \c" read EMAIL echo echo echo "Nagios alerts will be sent to $NAME at $EMAIL" print "Is this correct? [y/N] \c" read YN if [[ "$YN" = "y" ]] then LOOP=0 fi done if [ -r $NAGIOSCONTACTSCFG ] then perl -pi -e "s/nagiosadmin/$NAME/g" $NAGIOSCONTACTSCFG EMAIL=$(echo $EMAIL | sed s/\@/\\\\@/g) perl -pi -e "s/nagios\@localhost/$EMAIL/g" $NAGIOSCONTACTSCFG else echo "$NAGIOSCONTACTSCFG does not exist" exit 1 fi echo "setting ${NAME}'s password to be 'cluster' in Nagios" echo " you can change this later by running: " echo " htpasswd -c $NAGIOSPASSWD $Name)'" htpasswd -bc $NAGIOSPASSWD $NAME cluster if [ "$OS" = "rh" ] then service httpd restart fi } function preNagios { if [ "$OS" = "rh" ] then echo "making sure prereqs are installed" yum -y install httpd gcc glibc glibc-common gd gd-devel perl-TimeDate /usr/sbin/useradd -m nagios echo $PASSWD | passwd --stdin nagios /usr/sbin/groupadd nagcmd /usr/sbin/usermod -a -G nagcmd nagios /usr/sbin/usermod -a -G nagcmd apache fi } function postNagios { if [ "$OS" = "rh" ] then chkconfig --add nagios chkconfig nagios on # touch this file so that if it doesn't exist we won't get errors touch /var/www/html/index.html service nagios start fi echo "You may now be able to access Nagios at the URL below:" echo "http://localhost/nagios" } if [ -e /etc/redhat-release ] then echo "installing monitoring on Red Hat system" OS=rh fi # make sure you're root: ID=$(id -u) if [ "$ID" != "0" ] then echo "Must run this as root!" exit fi preNagios buildNagios buildNagiosPlug configNagios postNagios |
Run the script ./naginstall.sh
This code works on Red Hat systems and should run if you've installed all the dependencies mentioned in Part 1 of this series. While running naginstall.sh, you are prompted for the user that Nagios should send alerts to. You'll be able to add others later. Most organizations have a mail alias that will send to people in a group.
If you have problems installing, take a look at the Nagios Web page (see Resources for a link) and join the mailing list. In my experience, most packages that are as successful as Nagios and Ganglia are relatively easy to install.
So let's pretend the script just worked for you and you installed everything perfectly. Then when the script exited successfully, you should be able to open your Web browser and see that your own local host is being monitored (like in Figure 1):
Figure 1. Screen showing your local host being monitored
By clicking Service Detail, you can see that we are monitoring several services (like Ping, HTTP, load, users, etc. ) on the local machine. This was configured by default.
Let's examine the service called Root Partition. This service alerts you when the root partition gets full. You can get a full understanding of how this check is working by examining the configuration files that were generated upon installation.
If you used the naginstall.sh script, then the master configuration file is /usr/local/nagios/etc/nagios.cfg. This script shows several cfg_files that have additional definitions. Among them is the line:
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
If you examine this file, you will see all of the services for the localhost that are present on the Web view. This is where the default services are being configured. The Root Partition definition appears on line 77.
The hierarchy of how the root partition check is configured is shown in Figure 2.
Figure 2. How the root partition check is configured
First notice the inheritance scheme of Nagios objects. The definition of the Root Partition uses local-service definitions that in turn use the generic-service definitions. This defines how the service is called, how often, and other tunable parameters, etc.
The next important part of the definition is the check commands it uses.
First it uses a command definition called
check_local_disk
. The parameters it passes are
!20%!10%!/
. This means that when the
check_local_disk
command definition reports
20%
, it will issue a warning. When it hits
10%
, you'll get a critical error. The
/
means that it is checking the "/" partition.
The check_local_disk
in turn simply calls the
check_disk
command, which is located in the
/usr/local/nagios/libexec directory.
This is basic idea of how configurations are set up. You can use this to create your own services to monitor and tweak any of the parameters you want. For a more in-depth appreciation of what is going on, read the documentation and try setting some of the parameters yourself.
Now that we're all configured, sign up for alerts. We did this already in the beginning, but if you want to change or add users you can modify the /usr/local/nagios/etc/objects/contacts.cfg file. Just change the contact name to your name and the email to your email address. Most basic Linux servers should already be set up to handle mail.
Now let's configure other nodes.
Configure for other nodes in the grid/cloud/cluster
I have a group of nodes in my Dallas data center. I'll create a directory where I'll put all of my configuration files:
mkdir -p /usr/local/nagios/etc/dallas
I need to tell Nagios that my configuration files are going to go in there. I do this by modifying the nagios.cfg file, adding this line:
cfg_dir=/usr/local/nagios/etc/dallas
I'm going to be creating a couple of files here that can be pretty confusing. Figure 3 illustrates the entities and the files they belong to and shows the relationships between objects.
Figure 3. Diagram of entities and their files
Keep this diagram in mind as you move through the rest of this setup and installation.
In the /usr/local/nagios/etc/dallas/nodes.cfg file, I define all the nodes and node groups. I have three types of machines to monitor:
- Network servers (which in my case are Linux servers and have Ganglia running on them)
- Network switches (my switches, including high-speed and Gigabit Ethernet)
- Management devices (like blade management modules, old IBM RSA cards, BMCs, possibly smart PDUs, etc.)
I create three corresponding groups as follows:
define hostgroup { hostgroup_name dallas-cloud-servers alias Dallas Cloud Servers } define hostgroup hostgroup_name dallas-cloud-network alias Dallas Cloud Network Infrastructure } define hostgroup hostgroup_name dallas-cloud-management alias Dallas Cloud Management Devides } |
Next I create three template files with common characteristics for the nodes of these node groups to share:
define host { name dallas-management use linux-server hostgroups dallas-cloud-management # TEMPLATE! register 0 } define host { name dallas-server use linux-server hostgroups dallas-cloud-servers # TEMPLATE! register 0 } define host { name dallas-network use generic-switch hostgroups dallas-cloud-network # TEMPLATE! register 0 } |
Now my individual node definitions are either
dallas-management
,
dallas-server
, or
dallas-network
. Here is an example of each:
define host { use dallas-server host_name x336001 address 172.10.11.1 } define host { use dallas-network host_name smc001 address 172.10.0.254 } define host { use dallas-management host_name x346002-rsa address 172.10.11.12 } |
I generated a script to go through my list of nodes and completely populate that file with the nodes in my Dallas lab. When I restart Nagios, they'll all be checked to see if they're reachable. But I still have to add some other services!
You may want to restart Nagios first to make sure your settings took. If they did, then you should see some groups under the HostGroup Overview view. If you have errors, then run:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
That will validate your file and help you find any errors.
You can now add some basic services. Following the templates from localhost, an easy one to do is to check for SSH on the dallas-cloud-servers group. Let's start another file for that: /usr/local/nagios/etc/dallas/host-services.cfg. The easiest thing is to copy configs out of the localhost that you want monitored. I did that and added a dependency:
define service{ use generic-service hostgroup_name dallas-cloud-servers service_description SSH check_command check_ssh } define service{ use generic-service hostgroup_name dallas-cloud-servers service_description PING check_command check_ping!100.0,20%!500.0,60% } define servicedependency{ hostgroup_name dallas-cloud-servers service_description PING dependent_hostgroup_name dallas-cloud-servers dependent_service_description SSH } |
I didn't want SSH tested if PING didn't work. From this point you could add all sorts of things, but this gets us something to look at first. Restart Nagios and test the menus to make sure you see the ping and ssh checks for your nodes:
service nagios reload
All good? Okay, now let's get to the interesting part and integrate Ganglia.
Integrate Nagios to report on Ganglia metrics
Nagios Exchange is another great place to get plug-ins for Nagios. But for
our Ganglia plug-in to Nagios, look no further than the tarball you
downloaded in Part 1 of this article. Assuming you uncompressed your
tarball in the /tmp directory, it is only a matter of copying the
check_ganglia.py
script that is in the contrib
directory:
cp /tmp/ganglia-3.1.1/contrib/check_ganglia.py /usr/local/nagios/libexec/ |
check_ganglia
is a cool Python script that you
run on the same server where gmetad
is running
(and in my case, this is the management server where Nagios is running as
well).
Let's
have it query the localhost on port 8649. In this way, you don't expend
network traffic by running remote commands: You get the benefits of
Ganglia's scaling techniques to do this!
If you run telnet localhost 8649
,, you'll see a
ton of output on the node from data that has been collected on the nodes
(provided you have Ganglia up and running as we did in Part 1). Let's
monitor a few things that Ganglia has for us.
Digging in the /var/lib/ganglia/rrds directory, you can see the metrics
being measured on each host. Nice graphs are being generated, and you can
analyze the metrics over time. We're going to measure the
load_one
, disk_free
and since we enabled IPMI temperature measurements in Part 1, let's add
that measure in as well.
Create the /usr/local/nagios/etc/dallas/ganglia-services.cfg file and add the services to it:
define servicegroup { servicegroup_name ganglia-metrics alias Ganglia Metrics } define command { command_name check_ganglia command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ } define service { use generic-service name ganglia-service hostgroup_name dallas-cloud-servers service_groups ganglia-metrics notifications_enabled 0 } define service { use ganglia-service service_description load_one check_command check_ganglia!load_one!4!5 } define service { use ganglia-service service_description ambient_temp check_command check_ganglia!AmbientTemp!20!30 } define service { use ganglia-service service_description disk_free check_command check_ganglia!disk_free!10!5 } |
When you restart Nagios, you now can do alerts on Ganglia metrics!
One caveat: The check_ganglia.py
command only
alerts when thresholds get too high. If you want it to alert when
thresholds go too low (as in the case of
disk_free
), then you'll need to hack the code. I
changed the end of the file to look like so:
if critical > warning: if value >= critical: print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) sys.exit(2) elif value >= warning: print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) sys.exit(1) else: print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) sys.exit(0) else: if critical >= value: print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) sys.exit(2) elif warning >= value: print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) sys.exit(1) else: print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) sys.exit(0) |
Now reload Nagios:
service nagios restart
If all goes well, you should see Ganglia data being monitored by Nagios!
Figure 4. Ganglia data monitored by Nagios
With Ganglia and Nagios working together, you can go crazy and monitor just about anything now. You rule the cloud!
Extending Nagios: Monitor network switches
As clouds and virtualization become a part of life, the old boundaries of the "network guys" and the "systems guys" becomes more blurred. A sysadmin who continues to ignore configuring network switches and understanding network topologies runs the risk of becoming obsolete.
So you never have to face incompleteness, I'll show you how to extend Nagios to monitor a network switch. The advantage of using Nagios to monitor a network switch (instead of just relying on the switch vendor's solution) is simple - you can monitor any vendor's switch with Nagios. You've seen ping work, now let's explore SNMP on the switches.
Some switches come with SNMP enabled by default. You can set it up following vendor instructions. To set up SNMP on a Cisco Switch you can follow the example I give below for my switch whose hostname is c2960g:
telnet c2960g c2960g>enable c2960g#configure terminal c2960g(config)#snmp-server host 192.168.15.1 traps SNMPv1 c2960g(config)#snmp-server community public c2960g(config)#exit c2960g#copy running-config startup-config |
Now to see what you can monitor, run snmpwalk
and pipe it to a file like this:
snmpwalk -v 1 -c public c2960g
If all goes well you should see a ton of stuff passed back. You can then capture this output and look at different places to monitor.
I have another switch that I will use as an example here. When I run the
snmpwalk
command I see the ports and how they
are labeled. I'm interested in getting the following information:
- The MTU
(
IF-MIB::ifMtu.<portnumber>
). - The speed the ports are running at
(
IF-MIB::ifSpeed.<port number>
). - Whether or not the ports are up
(
IF-MIB::ifOperStatus.<port number>
).
To monitor this I'll create a new file, /usr/local/nagios/etc/dallas/switch-services.cfg. I have a map of my network hosts to switches so I know where everything is. You should too if you don't already. If you really want to be a cloud, all resources should have known states.
I'll use node x336001 as an example here. I know it's on port 5. Here is what my file looks like:
define servicegroup { servicegroup_name switch-snmp alias Switch SNMP Services } define service { use generic-service name switch-service host_name smc001 service_groups switch-snmp } define service { use switch-service service_description Port5-MTU-x336001 check_command check_snmp!-o IF-MIB::ifMtu.5 } define service { use switch-service service_description Port5-Speed-x336001 check_command check_snmp!-o IF-MIB::ifSpeed.5 } define service { use switch-service service_description Port5-Status-x336001 check_command check_snmp!-o IF-MIB::ifOperStatus.5 } |
When finished, you restart Nagios and you can see that I can now view my switch entries:
Figure 5. Monitoring switches
This is just one example of how to monitor switches. Notice that I did not
set up alerting nor indicate what would constitute a critical action. You
may also note that there are other options in the libexec directory that
can do similar things. The check_ifoperstatus
and others may do the trick as well. With Nagios there are many ways to
accomplish a single task.
Extending Nagios: Job reporting to monitor TORQUE
There are lots of scripts you can write against TORQUE to determine how this queueing system is running. In this extension, assume you already have TORQUE up and running. TORQUE is a resource manager that works with schedulers like Moab and Maui. Let's look at an open source Nagios plug-in that was written by Colin Morey.
Download this and put it into the /usr/local/nagios/libexec directory and
make sure its executable. I had to modify the code a little bit by
changing the directories where Nagios was installed by changing
use lib "/usr/nagios/libexec";
to
use lib "/usr/local/nagios/libexec";
. I also
had to change my $qstat = '/usr/bin/qstat' ;
to
wherever the qstat
command is. Mine looks like
this:
my $qstat = '/opt/torque/x86_64/bin/qstat' ;
.
Verify that it works, (My queue is called dque that I use):
[root@redhouse libexec]# ./check_pbs.pl -Q dque -tw 20 -tm 50 check_pbs.pl Critical: dque on localhost checked, Total number of jobs higher than 50. Total jobs:518, Jobs Queued:518, Jobs Waiting:0, Jobs Halted:0 |exectime=9340us |
You can use the -h
option to show more things
to monitor. Now let's put it into our configuration file
/usr/local/nagios/etc/dallas/torque.cfg:
define service { use generic-service host_name localhost service_description TORQUE Queues check_command check_pbs!20!50 } define command { command_name check_pbs command_line $USER1$/check_pbs.pl -Q dque -tw $ARG1$ -tm $ARG2$ } |
After restarting Nagios, the service shows up under localhost:
Figure 6. TORQUE service appears after Nagios restart
In mine, I get a critical alert because I have 518 jobs queued!
There are obviously more ways to track TORQUE and scripts that one could write and that have been written. You could go as far as writing scripts that use pbsnodes to tell the node status. People would be more concerned with where their nodes are running and how long the job has been running for. This little example just gives you an idea of what is possible and shows how good you can make your monitoring solution with a little time.
After reading this two-part series, a systems administrator should feel empowered to run Ganglia and Nagios to really monitor his data center as never before. The scope of these two packages is enormous. What we have touched on here though is relevant to a cluster, grid, or cloud infrastructure.
Most of the time setting up this monitoring solution was spent configuring the services you will want to monitor. Many existing alternative solutions are all plumbing and no appliances - in other words, they provide frameworks to allow for plug-ins but seldom come with premade plug-ins. Most of the plug-in work has to be done by an administrator or user and this work is often trivialized when in fact it makes up the bulk of excellent data center monitoring.
Ganglia and Nagios together are more than just the plumbing.
Learn
- Find more on Nagios in the
Nagios 3.x documentation.
- Go to
Nagios Exchange
for plug-ins.
-
TORQUE
is an open source resource manager providing control over batch jobs and
distributed compute nodes.
- In the
developerWorks Linux zone,
find more resources for Linux developers, and scan our
most popular articles and
tutorials.
- See all
Linux tips
and
Linux tutorials
on developerWorks.
- Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
- Get
Nagios and plug-ins. For installation help, see the
Nagios Web page
and join the mailing list.
- Read Colin Morey's
open source Nagios TORQUE plug-in.
- Some other
monitoring tools:
-
Order the SEK for Linux,
a two-DVD set containing the latest IBM trial software for Linux from
DB2?, Lotus?, Rational?, Tivoli?, and
WebSphere?.
- With
IBM trial software,
available for download directly from developerWorks, build your next
development project on Linux.
Discuss
- Get involved in the
developerWorks community
through blogs, forums, podcasts, and spaces.