Performance graphs stopped updating

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

Changed the nagios.cfg as requested and it seemed to have reduced the load by half. Load metrics now between 20 and 25.

We are running it with 86 hosts and 1588 services, the vast majority were configured by the wizard, which I believe are active monitoring.

Top 3 processes are mysqld (user mysqld), nagios (user nagios), and check_esx3.pl (user nagios).
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Performance graphs stopped updating

Post by mguthrie »

Hmm, something is not right with that plugin if it's consistently in the top 3, because it shouldn't be running for more than a couple of seconds. If it's hung, or in an endless loop that could eat a lot of system resources. You don't have that many checks running where it should be eating that much processor power. Try:

Code: Select all

killall nagios
service nagios start
and kill that check plugin as well. Nagios should call it up again when the check needs to run. Keep an eye on it though. If the graphs still aren't showing up, try restarting the server.
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

graphs are updating, but not keeping up to date. It appears that the information in the graphs tends to be 1-3 hours behind current time. I have attempted the killall command and it helps, but never fully catches up. This am the monitoring engine needed to be restarted, showing that it had stopped. This spiked the load matrix up to 53, 34, and 25 for 1,5,15 respectively.

Any other thoughts?

Current top process is consistently mysqld, regularly in the top 2.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Performance graphs stopped updating

Post by mguthrie »

Can you copy and paste the output from running top. I'd like to see the full output from it to get a better sense of where things are at.


Also, can you copy and paste the output from running: /usr/local/nagios/bin/nagiostats
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

From top:

Code: Select all

top - 12:01:55 up 1 day, 22:32,  1 user,  load average: 42.93, 39.12, 39.31
Tasks: 263 total,  16 running, 239 sleeping,   0 stopped,   8 zombie
Cpu(s): 10.7%us, 88.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   3107560k total,  2295504k used,   812056k free,   283800k buffers
Swap:  1048568k total,        0k used,  1048568k free,  1297388k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13982 nagios    24   0     0    0    0 R 11.4  0.0   0:00.40 check_ifopersta
14010 nagios    25   0 27100 9072 4880 R 11.4  0.3   0:00.40 php
14056 nagios    25   0 11380 7364 1836 R 10.8  0.2   0:00.38 check_ifopersta
26999 nagios    25   0 15776 3820  880 R 10.5  0.1   4:59.05 nagios
 3123 mysql     15   0  154m  34m 4924 S  8.0  1.1 291:15.60 mysqld
14235 nagios    25   0  6872 2792 1608 R  3.7  0.1   0:00.13 check_ifopersta
13918 nagios    18   0  4572 1100  948 S  2.0  0.0   0:00.07 check_rrdtraf
13964 nagios    18   0  4572 1104  948 S  2.0  0.0   0:00.07 check_rrdtraf
10427 nagios    18   0 29536  14m 5888 S  1.4  0.5   0:01.17 php
13827 nagios    18   0  4572 1104  948 S  1.4  0.0   0:00.07 check_rrdtraf
13925 nagios    18   0  4572 1104  948 S  1.4  0.0   0:00.06 check_rrdtraf
  632 root      11  -5     0    0    0 S  1.1  0.0  18:38.41 kjournald
 6781 nagios    15   0 15928 4080 1112 S  1.1  0.1  81:31.55 nagios
14277 nagios    25   0     4    4    0 R  1.1  0.0   0:00.04 sh
14292 nagios    25   0  4012  576  492 R  1.1  0.0   0:00.04 check_icmp
14339 nagios    25   0  4544  468  392 R  1.1  0.0   0:00.04 check_rrdtraf
25384 root      15   0     0    0    0 S  1.1  0.0   0:45.02 pdflush
from nagiostats:

Code: Select all

Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (http://www.nagios.org)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 0s
Status File Version:                    3.2.3

Program Running Time:                   0d 2h 2m 31s
Nagios PID:                             26999
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         1591
Services Checked:                       1591
Services Scheduled:                     1591
Services Actively Checked:              1591
Services Passively Checked:             0
Total Service State Change:             0.000 / 19.470 / 0.035 %
Active Service Latency:                 0.007 / 1202.996 / 28.594 sec
Active Service Execution Time:          0.073 / 27.438 / 4.825 sec
Active Service State Change:            0.000 / 19.470 / 0.035 %
Active Services Last 1/5/15/60 min:     88 / 351 / 834 / 1522
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              1554 / 3 / 0 / 34
Services Flapping:                      2
Services In Downtime:                   0

Total Hosts:                            87
Hosts Checked:                          87
Hosts Scheduled:                        87
Hosts Actively Checked:                 87
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.000 / 162.067 / 13.217 sec
Active Host Execution Time:             0.106 / 8.173 / 1.096 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        0 / 8 / 48 / 85
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  87 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     3 / 36 / 132
   Scheduled:                           0 / 27 / 106
   On-demand:                           3 / 9 / 26
   Parallel:                            0 / 27 / 106
   Serial:                              0 / 0 / 0
   Cached:                              3 / 9 / 26
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  193 / 759 / 2280
   Scheduled:                           193 / 759 / 2280
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Performance graphs stopped updating

Post by mguthrie »

The the performance graphs is almost certainly coming from the cpu load issues. The real question will be determining what is causing that level of performance drain on your system.

Can you give us your system specs:
Hardware setup? (CPU Cores, RAM, Hard Disk setup)

Linux Distro? 32 or 64bit?

VM or Physical Box?
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

Hardware setup - VM downloaded originally from Nagios. I believe it is a 32-bit setup running CentOS.

Running on an ESX server, 4 CPU's assigned to the VM with 4 GB RAM. Hard disk 16GB in size.
tonyyarusso
Posts: 1128
Joined: Wed Mar 03, 2010 12:38 pm
Location: St. Paul, MN, USA

Re: Performance graphs stopped updating

Post by tonyyarusso »

He already gave us some of those Mike. :)

Hardware setup? (CPU Cores, RAM, Hard Disk setup)
4 virtual CPU cores, roughly 2GHz each
3GB RAM
1GB swap

Linux Distro? 32 or 64bit?
CentOS 5, 32-bit

VM or Physical Box?
VMware on ESX.

I'd still like to know the type / number / configuration of disks, since that could be a bottleneck.
Top 3 processes are mysqld (user mysqld), nagios (user nagios), and check_esx3.pl (user nagios).
I totally missed this before - how many check_esx3.pl services do you have? I remember in testing that one used WAY more resources than any normal plugin, since it has to load the entire VMware Perl API separately for each and every check, but I don't see it in the sample top output you included.
Tony Yarusso
Technical Services
___
TIES
Web: http://ties.k12.mn.us/
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

Do you want the configuration of the physical ESX server (RAID 5 on 7.2k RPM disks) or additional information from the VM?

The check_esx plugin I had reinstalled based of the first conversations, and has not appeared in my glances in the top processes since.
FLCUISIT
Posts: 93
Joined: Mon Feb 01, 2010 12:09 pm

Re: Performance graphs stopped updating

Post by FLCUISIT »

Any other suggestions? We are still having this issue.