Ganglia and Nagios, Part 2: Monitor enterpris...

a1101 2012-04-26

展开全文

Ganglia and Nagios, Part 2: Monitor enterprise clusters with Nagios

Install Nagios to effectively monitor a data center; make Ganglia and Nagios work together

Vallard Benincosa, Certified Technical Sales Specialist, IBM

Summary: This is the second article in a two-part series that looks at a hands-on approach to monitoring a data center using the open source tools Ganglia and Nagios. In Part 2, learn how to install and configure Nagios, the popular open source computer system and network monitoring application software that watches hosts and services, alerting users when things go wrong. The article also shows you how to unite Nagios with Ganglia (from Part 1) and add two other features to Nagios for standard clusters, grids, and clouds to help with monitoring network switches and the resource manager.

View more content in this series

Tags for this article: monitoring, sysadmin

Tag this!

Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests

Date: 25 Mar 2009
Level: Intermediate
PDF: A4 and Letter (235KB | 20 pages)Get Adobe? Reader?
Also available in: Russian Japanese Portuguese

Activity: 46109 views
Comments: 0 (View | Add comment - Sign in)

Average rating 5 stars based on 11 votes

Average rating (11 votes)
Rate this article

Recap of Part 1

Data centers are growing and administrative staffs are shrinking, necessitating efficient monitoring tools for compute resources. Part 1 of this series discussed the benefits of using Ganglia and Nagios together, then showed you how to install and extend Ganglia with homemade monitoring scripts.

Recall from Part 1 the multiple definitions of monitoring (depending on the implier and the inferrer):

If you're running applications on the cluster, you think: "When will my job run? When will it be done? And how is it performing compared to last time?"
If you're the operator in the network operations center, you think: "When will we see a red light that means something needs to be fixed and a service call placed?"
If you're in the systems engineering group, you think: "How are our machines performing? Are all the services functioning correctly? What trends do we see, and how can we better utilize our compute resources?"

You can find code to monitor exactly what you want to monitor and that code can be of the open source variety. The most difficult part of using open source monitoring tools comes when you attempt to implement an install and puzzle out a configuration that works well for your environment. Two major problems with open source (and commercial) monitoring tools are the following:

No tool will monitor everything you want the way you want it.
Much customization could be required to get the tool working in your data center exactly how you want it.

Ganglia is a tool that monitors data centers and is used heavily in high-performance computing environments (but it's attractive for other environments too like clouds, render farms, and hosting centers). It is more concerned with gathering metrics and tracking them over time compared with Nagios's focus as an alerting mechanism. Ganglia used to require an agent to run on every host to gather information from it, but now metrics can be obtained from just about anything through Ganglia's spoofing mechanism. Ganglia doesn't have a built-in notification system, but it was designed to support scalable built-in agents on target hosts.

After reading Part 1, you could install Ganglia, as well as answer the monitoring questions that different user groups tend to ask. You could also configure the basic Ganglia setup, use the Python modules to extend functionality with IPMI (the Intelligent Platform Management Interface), and use Ganglia host spoofing to monitor IPMI.

Now, let's look at Nagios.

Introducing Nagios

This part shows you how to install Nagios and tie Ganglia back into it. We're going to add two features to Nagios that'll help your monitoring efforts in standard clusters, grids, clouds (or whatever your favorite buzzword is for scale-out computing). The two features are all about:

Monitoring network switches
Monitoring the resource manager

In this case, we'll be monitoring TORQUE. When we are finished, you'll have a framework to control the monitoring system of your entire data center.

Nagios, like Ganglia, is used heavily in HPC and other environments, but Nagios is more of an alerting mechanism that Ganglia (which is more focused on gathering and tracking metrics). Nagios previously only polled information from its target hosts, but has recently developed plug-ins that allow it to run agents on those hosts. Nagios has a built-in notification system.

Now let's install Nagios and set up a baseline monitoring system of an HPC Linux? cluster to address the three different monitoring perspectives:

The application person can see how full the queues are and see available nodes for running jobs.
The NOC can be alerted of system failures or see a shiny red error light on the Nagios Web interface. They also get notified via email if nodes go down or temperatures get too high.
The system engineer can graph data, report on cluster utilization, and make decisions on future hardware acquisitions.

Installing Nagios

The effort to get Nagios rolling on your machine is well documented on the Internet. Since I tend to install it a lot in different environments, I wrote a script to do it.

First you need to download two packages:

Nagios (tested with version 3.0.6)
Nagios-plugins (tested with version 1.4.13)

The add-ons include:

The Nagios Event Log, which allows for monitoring Windows event logs
The NRPE, which provides a lot of Ganglia functionality

Get the tarballs and place them in a directory. For example, I have the following three files in /tmp:

nagios-3.0.6.tar.gz
nagios-plugins-1.4.13.tar.gz
naginstall.sh

Listing 1 shows the naginstall.sh install script:

Listing 1. The naginstall.sh script

#!/bin/ksh

NAGIOSSRC=nagios-3.0.6
NAGIOSPLUGINSRC=nagios-plugins-1.4.13
NAGIOSCONTACTSCFG=/usr/local/nagios/etc/objects/contacts.cfg
NAGIOSPASSWD=/usr/local/nagios/etc/htpasswd.users
PASSWD=cluster
OS=foo

function buildNagiosPlug {

  if [ -e $NAGIOSPLUGINSRC.tar.gz ]
  then
    echo "found $NAGIOSPLUGINSRC.tar.gz  building and installing Nagios"
  else
    echo "could not find $NAGIOSPLUGINSRC.tar.gz in current directory."
    echo "Please run $0 in the same directory as the source files."
    exit 1
  fi
  echo "Extracting Nagios Plugins..."
  tar zxf $NAGIOSPLUGINSRC.tar.gz
  cd $NAGIOSPLUGINSRC
  echo "Configuring Nagios Plugins..."
  if ./configure --with-nagios-user=nagios --with-nagios-group=nagios
      -prefix=/usr/local/nagios > config.LOG.$$ 2>&1
  then
    echo "Making Nagios Plugins..."
    if make -j8 > make.LOG.$$ 2>&1
    then
      make install > make.LOG.$$ 2>&1
    else
      echo "Make failed of Nagios plugins.  See $NAGIOSPLUGINSRC/make.LOG.$$"
      exit 1
    fi
  else
    echo "configure of Nagios plugins failed.  See config.LOG.$$"
    exit 1
  fi
  echo "Successfully built and installed Nagios Plugins!"
  cd ..

}

function buildNagios {
  if [ -e $NAGIOSSRC.tar.gz ]
  then
    echo "found $NAGIOSSRC.tar.gz  building and installing Nagios"
  else
    echo "could not find $NAGIOSSRC.tar.gz in current directory."
    echo "Please run $0 in the same directory as the source files."
    exit 1
  fi
  echo "Extracting Nagios..."
  tar zxf $NAGIOSSRC.tar.gz
  cd $NAGIOSSRC
  echo "Configuring Nagios..."
  if ./configure --with-command-group=nagcmd > config.LOG.$$ 2>&1
  then
    echo "Making Nagios..."
    if make all -j8 > make.LOG.$$ 2>&1
    then
      make install > make.LOG.$$ 2>&1
      make install-init > make.LOG.$$ 2>&1
      make install-config > make.LOG.$$ 2>&1
      make install-commandmode > make.LOG.$$ 2>&1
      make install-webconf > make.LOG.$$ 2>&1
    else
      echo "make all failed.  See log:"
      echo "$NAGIOSSRC/make.LOG.$$"
      exit 1
    fi
  else
    echo "configure of Nagios failed.  Please read $NAGIOSSRC/config.LOG.$$ for details."
    exit 1
  fi
  echo "Done Making Nagios!"
  cd ..
}


function configNagios {
  echo "We'll now configure Nagios."
  LOOP=1
  while [[ $LOOP -eq 1 ]]
  do
    echo "You'll need to put in a user name.  This should be the person"
    echo "who will be receiving alerts.  This person should have an account"
    echo "on this server.  "
    print "Type in the userid of the person who will receive alerts (e.g. bob)> \c"
    read NAME
    print "What is ${NAME}'s email?> \c"
    read EMAIL
    echo
    echo
    echo "Nagios alerts will be sent to $NAME at $EMAIL"
    print "Is this correct? [y/N] \c"
    read YN
    if [[ "$YN" = "y" ]]
    then
      LOOP=0
    fi
  done
  if [ -r $NAGIOSCONTACTSCFG ]
  then
    perl -pi -e "s/nagiosadmin/$NAME/g" $NAGIOSCONTACTSCFG
    EMAIL=$(echo $EMAIL | sed s/\@/\\\\@/g)
    perl -pi -e "s/nagios\@localhost/$EMAIL/g" $NAGIOSCONTACTSCFG
  else
    echo "$NAGIOSCONTACTSCFG does not exist"
    exit 1
  fi

  echo "setting ${NAME}'s password to be 'cluster' in Nagios"
  echo "    you can change this later by running: "
  echo "    htpasswd -c $NAGIOSPASSWD $Name)'"
  htpasswd -bc $NAGIOSPASSWD $NAME cluster
  if [ "$OS" = "rh" ]
  then
    service httpd restart
  fi

}


function preNagios {

  if [ "$OS" = "rh" ]
  then
    echo "making sure prereqs are installed"
    yum -y install httpd gcc glibc glibc-common gd gd-devel perl-TimeDate
    /usr/sbin/useradd -m nagios
    echo $PASSWD | passwd --stdin nagios
    /usr/sbin/groupadd nagcmd
    /usr/sbin/usermod -a -G nagcmd nagios
    /usr/sbin/usermod -a -G nagcmd apache
  fi

}
function postNagios {
  if [ "$OS" = "rh" ]
  then
    chkconfig --add nagios
    chkconfig nagios on
    # touch this file so that if it doesn't exist we won't get errors
    touch /var/www/html/index.html
    service nagios start
  fi
  echo "You may now be able to access Nagios at the URL below:"
  echo "http://localhost/nagios"

}



if [ -e /etc/redhat-release ]
then
  echo "installing monitoring on Red Hat system"
  OS=rh
fi

# make sure you're root:
ID=$(id -u)
if [ "$ID" != "0" ]
then
  echo "Must run this as root!"
  exit
fi

preNagios
buildNagios
buildNagiosPlug
configNagios
postNagios

Run the script ./naginstall.sh

This code works on Red Hat systems and should run if you've installed all the dependencies mentioned in Part 1 of this series. While running naginstall.sh, you are prompted for the user that Nagios should send alerts to. You'll be able to add others later. Most organizations have a mail alias that will send to people in a group.

If you have problems installing, take a look at the Nagios Web page (see Resources for a link) and join the mailing list. In my experience, most packages that are as successful as Nagios and Ganglia are relatively easy to install.

Configuring Nagios

So let's pretend the script just worked for you and you installed everything perfectly. Then when the script exited successfully, you should be able to open your Web browser and see that your own local host is being monitored (like in Figure 1):

Figure 1. Screen showing your local host being monitored
Screen showing your local host being monitored

By clicking Service Detail, you can see that we are monitoring several services (like Ping, HTTP, load, users, etc. ) on the local machine. This was configured by default.

Let's examine the service called Root Partition. This service alerts you when the root partition gets full. You can get a full understanding of how this check is working by examining the configuration files that were generated upon installation.

The master configuration file

If you used the naginstall.sh script, then the master configuration file is /usr/local/nagios/etc/nagios.cfg. This script shows several cfg_files that have additional definitions. Among them is the line:

cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

If you examine this file, you will see all of the services for the localhost that are present on the Web view. This is where the default services are being configured. The Root Partition definition appears on line 77.

The hierarchy of how the root partition check is configured is shown in Figure 2.

Figure 2. How the root partition check is configured
How the root partition check is configured

First notice the inheritance scheme of Nagios objects. The definition of the Root Partition uses local-service definitions that in turn use the generic-service definitions. This defines how the service is called, how often, and other tunable parameters, etc.

The next important part of the definition is the check commands it uses. First it uses a command definition called check_local_disk. The parameters it passes are !20%!10%!/. This means that when the check_local_disk command definition reports 20%, it will issue a warning. When it hits 10%, you'll get a critical error. The / means that it is checking the "/" partition. The check_local_disk in turn simply calls the check_disk command, which is located in the /usr/local/nagios/libexec directory.

This is basic idea of how configurations are set up. You can use this to create your own services to monitor and tweak any of the parameters you want. For a more in-depth appreciation of what is going on, read the documentation and try setting some of the parameters yourself.

Now that we're all configured, sign up for alerts. We did this already in the beginning, but if you want to change or add users you can modify the /usr/local/nagios/etc/objects/contacts.cfg file. Just change the contact name to your name and the email to your email address. Most basic Linux servers should already be set up to handle mail.

Now let's configure other nodes.

Configure for other nodes in the grid/cloud/cluster

I have a group of nodes in my Dallas data center. I'll create a directory where I'll put all of my configuration files:

mkdir -p /usr/local/nagios/etc/dallas

I need to tell Nagios that my configuration files are going to go in there. I do this by modifying the nagios.cfg file, adding this line:

cfg_dir=/usr/local/nagios/etc/dallas

I'm going to be creating a couple of files here that can be pretty confusing. Figure 3 illustrates the entities and the files they belong to and shows the relationships between objects.

Figure 3. Diagram of entities and their files
Diagram of entities and their files

Keep this diagram in mind as you move through the rest of this setup and installation.

In the /usr/local/nagios/etc/dallas/nodes.cfg file, I define all the nodes and node groups. I have three types of machines to monitor:

Network servers (which in my case are Linux servers and have Ganglia running on them)
Network switches (my switches, including high-speed and Gigabit Ethernet)
Management devices (like blade management modules, old IBM RSA cards, BMCs, possibly smart PDUs, etc.)

I create three corresponding groups as follows:

define hostgroup {
 hostgroup_name dallas-cloud-servers
 alias Dallas Cloud Servers
}

define hostgroup
 hostgroup_name dallas-cloud-network
 alias Dallas Cloud Network Infrastructure
}

define hostgroup
 hostgroup_name dallas-cloud-management
 alias Dallas Cloud Management Devides
}

Next I create three template files with common characteristics for the nodes of these node groups to share:

define host {
        name dallas-management
        use linux-server
        hostgroups dallas-cloud-management
        # TEMPLATE!
        register 0
}


define host {
        name dallas-server
        use linux-server
        hostgroups dallas-cloud-servers
        # TEMPLATE!
        register 0
}

define host {
        name dallas-network
        use generic-switch
        hostgroups dallas-cloud-network
        # TEMPLATE!
        register 0
}

Now my individual node definitions are either dallas-management, dallas-server, or dallas-network. Here is an example of each:

define host {
 use dallas-server
 host_name x336001
 address 172.10.11.1
}
define host {
 use dallas-network
 host_name smc001
 address 172.10.0.254
}
define host {
 use dallas-management
 host_name x346002-rsa
 address 172.10.11.12
}

I generated a script to go through my list of nodes and completely populate that file with the nodes in my Dallas lab. When I restart Nagios, they'll all be checked to see if they're reachable. But I still have to add some other services!

You may want to restart Nagios first to make sure your settings took. If they did, then you should see some groups under the HostGroup Overview view. If you have errors, then run:

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

That will validate your file and help you find any errors.

You can now add some basic services. Following the templates from localhost, an easy one to do is to check for SSH on the dallas-cloud-servers group. Let's start another file for that: /usr/local/nagios/etc/dallas/host-services.cfg. The easiest thing is to copy configs out of the localhost that you want monitored. I did that and added a dependency:

define service{
        use                             generic-service
        hostgroup_name                  dallas-cloud-servers
        service_description             SSH
        check_command                   check_ssh
        }

define service{
        use                             generic-service
        hostgroup_name                  dallas-cloud-servers
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        }

define servicedependency{
        hostgroup_name                  dallas-cloud-servers
        service_description             PING
        dependent_hostgroup_name        dallas-cloud-servers
        dependent_service_description   SSH
}

I didn't want SSH tested if PING didn't work. From this point you could add all sorts of things, but this gets us something to look at first. Restart Nagios and test the menus to make sure you see the ping and ssh checks for your nodes:

service nagios reload

All good? Okay, now let's get to the interesting part and integrate Ganglia.

Integrate Nagios to report on Ganglia metrics

Nagios Exchange is another great place to get plug-ins for Nagios. But for our Ganglia plug-in to Nagios, look no further than the tarball you downloaded in Part 1 of this article. Assuming you uncompressed your tarball in the /tmp directory, it is only a matter of copying the check_ganglia.py script that is in the contrib directory:

cp /tmp/ganglia-3.1.1/contrib/check_ganglia.py /usr/local/nagios/libexec/

check_ganglia is a cool Python script that you run on the same server where gmetad is running (and in my case, this is the management server where Nagios is running as well). Let's have it query the localhost on port 8649. In this way, you don't expend network traffic by running remote commands: You get the benefits of Ganglia's scaling techniques to do this!

If you run telnet localhost 8649,, you'll see a ton of output on the node from data that has been collected on the nodes (provided you have Ganglia up and running as we did in Part 1). Let's monitor a few things that Ganglia has for us.

Digging in the /var/lib/ganglia/rrds directory, you can see the metrics being measured on each host. Nice graphs are being generated, and you can analyze the metrics over time. We're going to measure the load_one, disk_free and since we enabled IPMI temperature measurements in Part 1, let's add that measure in as well.

Create the /usr/local/nagios/etc/dallas/ganglia-services.cfg file and add the services to it:

define servicegroup {
  servicegroup_name ganglia-metrics
  alias Ganglia Metrics
}

define command {
  command_name check_ganglia
  command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$
}

define service {
  use generic-service
  name ganglia-service
  hostgroup_name dallas-cloud-servers
  service_groups ganglia-metrics
  notifications_enabled 0
}


define service {
  use ganglia-service
  service_description load_one
  check_command check_ganglia!load_one!4!5
}


define service {
  use ganglia-service
  service_description ambient_temp
  check_command check_ganglia!AmbientTemp!20!30
}

define service {
  use ganglia-service
  service_description disk_free
  check_command check_ganglia!disk_free!10!5
}

When you restart Nagios, you now can do alerts on Ganglia metrics!

One caveat: The check_ganglia.py command only alerts when thresholds get too high. If you want it to alert when thresholds go too low (as in the case of disk_free), then you'll need to hack the code. I changed the end of the file to look like so:

  if critical > warning:
    if value >= critical:
      print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
      sys.exit(2)
    elif value >= warning:
      print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
      sys.exit(1)
    else:
      print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
      sys.exit(0)
  else:
    if critical >= value:
      print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
      sys.exit(2)
    elif warning >= value:
      print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
      sys.exit(1)
    else:
      print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
      sys.exit(0)

Now reload Nagios:

service nagios restart

If all goes well, you should see Ganglia data being monitored by Nagios!

Figure 4. Ganglia data monitored by Nagios
Ganglia data monitored by Nagios

With Ganglia and Nagios working together, you can go crazy and monitor just about anything now. You rule the cloud!

Extending Nagios: Monitor network switches

As clouds and virtualization become a part of life, the old boundaries of the "network guys" and the "systems guys" becomes more blurred. A sysadmin who continues to ignore configuring network switches and understanding network topologies runs the risk of becoming obsolete.

So you never have to face incompleteness, I'll show you how to extend Nagios to monitor a network switch. The advantage of using Nagios to monitor a network switch (instead of just relying on the switch vendor's solution) is simple - you can monitor any vendor's switch with Nagios. You've seen ping work, now let's explore SNMP on the switches.

Some switches come with SNMP enabled by default. You can set it up following vendor instructions. To set up SNMP on a Cisco Switch you can follow the example I give below for my switch whose hostname is c2960g:

telnet c2960g
c2960g>enable
c2960g#configure terminal
c2960g(config)#snmp-server host 192.168.15.1 traps SNMPv1
c2960g(config)#snmp-server community public
c2960g(config)#exit
c2960g#copy running-config startup-config

Now to see what you can monitor, run snmpwalk and pipe it to a file like this:

snmpwalk -v 1 -c public c2960g

If all goes well you should see a ton of stuff passed back. You can then capture this output and look at different places to monitor.

I have another switch that I will use as an example here. When I run the snmpwalk command I see the ports and how they are labeled. I'm interested in getting the following information:

The MTU (IF-MIB::ifMtu.<portnumber>).
The speed the ports are running at (IF-MIB::ifSpeed.<port number>).
Whether or not the ports are up (IF-MIB::ifOperStatus.<port number>).

To monitor this I'll create a new file, /usr/local/nagios/etc/dallas/switch-services.cfg. I have a map of my network hosts to switches so I know where everything is. You should too if you don't already. If you really want to be a cloud, all resources should have known states.

I'll use node x336001 as an example here. I know it's on port 5. Here is what my file looks like:

define servicegroup {
  servicegroup_name switch-snmp
  alias Switch SNMP Services
}

define service {
  use generic-service
  name switch-service
  host_name smc001
  service_groups switch-snmp
}

define service {
  use switch-service
  service_description Port5-MTU-x336001
  check_command check_snmp!-o IF-MIB::ifMtu.5
}
define service {
  use switch-service
  service_description Port5-Speed-x336001
  check_command check_snmp!-o IF-MIB::ifSpeed.5
}

define service {
  use switch-service
  service_description Port5-Status-x336001
  check_command check_snmp!-o IF-MIB::ifOperStatus.5
}

When finished, you restart Nagios and you can see that I can now view my switch entries:

Figure 5. Monitoring switches

This is just one example of how to monitor switches. Notice that I did not set up alerting nor indicate what would constitute a critical action. You may also note that there are other options in the libexec directory that can do similar things. The check_ifoperstatus and others may do the trick as well. With Nagios there are many ways to accomplish a single task.

Extending Nagios: Job reporting to monitor TORQUE

There are lots of scripts you can write against TORQUE to determine how this queueing system is running. In this extension, assume you already have TORQUE up and running. TORQUE is a resource manager that works with schedulers like Moab and Maui. Let's look at an open source Nagios plug-in that was written by Colin Morey.

Download this and put it into the /usr/local/nagios/libexec directory and make sure its executable. I had to modify the code a little bit by changing the directories where Nagios was installed by changing use lib "/usr/nagios/libexec"; to use lib "/usr/local/nagios/libexec";. I also had to change my $qstat = '/usr/bin/qstat' ; to wherever the qstat command is. Mine looks like this: my $qstat = '/opt/torque/x86_64/bin/qstat' ;.

Verify that it works, (My queue is called dque that I use):

[root@redhouse libexec]# ./check_pbs.pl -Q dque -tw 20 -tm 50
check_pbs.pl Critical: dque on localhost checked, Total number of jobs 
higher than 50.  Total jobs:518, Jobs Queued:518, Jobs Waiting:0, Jobs 
Halted:0 |exectime=9340us

You can use the -h option to show more things to monitor. Now let's put it into our configuration file /usr/local/nagios/etc/dallas/torque.cfg:

define service {
        use                             generic-service
        host_name                       localhost
        service_description             TORQUE Queues
        check_command                   check_pbs!20!50
}

define command {
        command_name                    check_pbs
        command_line                    $USER1$/check_pbs.pl -Q dque 
                                         -tw $ARG1$ -tm $ARG2$
}

After restarting Nagios, the service shows up under localhost:

Figure 6. TORQUE service appears after Nagios restart

In mine, I get a critical alert because I have 518 jobs queued!

There are obviously more ways to track TORQUE and scripts that one could write and that have been written. You could go as far as writing scripts that use pbsnodes to tell the node status. People would be more concerned with where their nodes are running and how long the job has been running for. This little example just gives you an idea of what is possible and shows how good you can make your monitoring solution with a little time.

Conclusion

After reading this two-part series, a systems administrator should feel empowered to run Ganglia and Nagios to really monitor his data center as never before. The scope of these two packages is enormous. What we have touched on here though is relevant to a cluster, grid, or cloud infrastructure.

Most of the time setting up this monitoring solution was spent configuring the services you will want to monitor. Many existing alternative solutions are all plumbing and no appliances - in other words, they provide frameworks to allow for plug-ins but seldom come with premade plug-ins. Most of the plug-in work has to be done by an administrator or user and this work is often trivialized when in fact it makes up the bulk of excellent data center monitoring.

Ganglia and Nagios together are more than just the plumbing.

Resources

Learn

Find more on Nagios in the Nagios 3.x documentation.
Go to Nagios Exchange for plug-ins.
TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes.
In the developerWorks Linux zone, find more resources for Linux developers, and scan our most popular articles and tutorials.
See all Linux tips and Linux tutorials on developerWorks.
Stay current with developerWorks technical events and Webcasts.

Get products and technologies

Get Nagios and plug-ins. For installation help, see the Nagios Web page and join the mailing list.
Read Colin Morey's open source Nagios TORQUE plug-in.
Some other monitoring tools:
- Cacti.
- Zenoss.
- Zabbix.
- Performance Co-Pilot.
- Clumon.
Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2?, Lotus?, Rational?, Tivoli?, and WebSphere?.
With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Get involved in the developerWorks community through blogs, forums, podcasts, and spaces.

About the author

Vallard Benincosa has been building HPC clusters since 2001. He works on many of the largest compute farms that IBM deploys and has helped design, install, and manage some of IBM's largest Linux clusters, including the ones at the Ohio Super Computing Center and NASA.