Seriously High Load Average

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
chrisp
Posts: 71
Joined: Fri Dec 28, 2012 11:35 am

Seriously High Load Average

Post by chrisp »

Hi,

We're seeing seriously high load average on our Nagios server, using Nagios XI.

We are in a program of migration of service checks from our old Nagios server to a new one, but we're miles away from having all our service checks ported over to the new system and our load seems to be hugely high on the new system (right now it's load average: 34.89, 21.68, 20.98). That's bad, but this is worse: -

[1358840273] SERVICE ALERT: nagios;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 78.42, 33.55, 23.60

I wondered if you might have any clues or advice for us please? I've tried to capture a few occasions where the system looks busy, in "top" and included the output here: -

Code: Select all

[root@Nagios admin]# top -M 

top - 10:03:04 up 17 days,  9:23,  3 users,  load average: 33.77, 19.58, 18.93
Tasks: 299 total,   2 running, 287 sleeping,   1 stopped,   9 zombie
Cpu0  : 64.8%us,  8.1%sy,  0.0%ni, 26.4%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 62.3%us, 10.8%sy,  0.0%ni, 26.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 91.6%us,  7.8%sy,  0.0%ni,  0.0%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 61.6%us, 11.7%sy,  0.0%ni, 26.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    15.571G total,   14.520G used, 1077.016M free,  129.570M buffers
Swap:   31.248G total,   24.719M used,   31.224G free, 7359.934M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                          
18370 apache    20   0  359m  39m 7772 R 30.2  0.2   0:04.94 httpd                                                                             
15410 mysql     20   0 2201m  60m 6140 S  4.4  0.4   7:52.21 mysqld                                                                            
 3276 nagios    20   0  221m  23m 9280 S  2.5  0.1   0:00.08 php                                                                               
 3288 nagios    20   0  214m  21m 7748 S  2.2  0.1   0:00.07 php                                                                               
  720 root      20   0     0    0    0 S  1.9  0.0   0:50.92 kswapd0                                                                           
 3275 nagios    20   0  214m  20m 7096 S  1.9  0.1   0:00.06 php                                                                               
 3282 nagios    20   0  215m  22m 7620 S  1.9  0.1   0:00.06 php                                                                               
 3284 nagios    20   0  215m  22m 7628 S  1.9  0.1   0:00.06 php                                                                               
 3285 nagios    20   0  214m  21m 7588 S  1.9  0.1   0:00.06 php                                                                               
 3289 nagios    20   0  214m  21m 7612 S  1.9  0.1   0:00.06 php                                                                               
  913 apache    20   0  349m  32m 7392 S  1.6  0.2   0:00.24 httpd                                                                             
 1025 apache    20   0  342m  28m 4268 S  1.6  0.2   0:00.16 httpd                                                                             
 1050 apache    20   0  335m  21m 4480 S  1.6  0.1   0:00.24 httpd                                                                             
 1206 nagios    20   0 50664 1968  940 S  1.2  0.0   0:36.97 ndo2db                                                                            
 2562 root      20   0  192m 1844  852 S  1.2  0.0  14:57.38 snmpd                                                                             
    3 root      20   0     0    0    0 S  0.9  0.0   1:07.44 ksoftirqd/0                                                                       
 1209 nagios    20   0 27340 4288  988 S  0.9  0.0   2:22.62 nagios                                                                            
 2510 named     20   0  369m  30m 2536 S  0.6  0.2  27:16.69 named                                                                             
18541 apache    20   0  350m  33m 7668 S  0.6  0.2   0:02.22 httpd                                                                             
   13 root      20   0     0    0    0 S  0.3  0.0   1:11.91 ksoftirqd/2                                                                       
   16 root      20   0     0    0    0 S  0.3  0.0   1:07.64 ksoftirqd/3            

Code: Select all

[root@Nagios admin]# top -M 

top - 10:05:01 up 17 days,  9:24,  3 users,  load average: 12.39, 17.60, 18.40
Tasks: 214 total,   2 running, 211 sleeping,   1 stopped,   0 zombie
Cpu0  :  5.6%us,  2.3%sy,  0.0%ni, 91.4%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 42.7%us,  0.7%sy,  0.0%ni, 56.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  6.4%us,  0.7%sy,  0.0%ni, 93.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  7.3%us,  1.3%sy,  0.0%ni, 88.0%id,  3.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    15.571G total,   14.038G used, 1570.555M free,  129.672M buffers
Swap:   31.248G total,   24.715M used,   31.224G free, 7360.020M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                          
  925 apache    20   0  388m  71m 7600 R 41.9  0.4   0:02.26 httpd                                                                             
15410 mysql     20   0 2201m  60m 6140 S  5.0  0.4   7:53.51 mysqld                                                                            
 1054 apache    20   0  350m  32m 7568 S  2.7  0.2   0:01.05 httpd                                                                             
18366 apache    20   0  350m  33m 7812 S  2.7  0.2   0:04.41 httpd                                                                             
  955 apache    20   0  350m  33m 7592 S  2.3  0.2   0:01.01 httpd                                                                             
 1019 apache    20   0  349m  32m 7712 S  2.3  0.2   0:01.07 httpd                                                                             
 1052 apache    20   0  349m  32m 7684 S  2.3  0.2   0:01.07 httpd                                                                             
 1053 apache    20   0  349m  32m 7596 S  2.3  0.2   0:01.15 httpd                                                                             
 2066 apache    20   0  350m  32m 7560 S  2.3  0.2   0:02.32 httpd                                                                             
 1209 nagios    20   0 27340 4284  988 S  1.0  0.0   2:23.14 nagios                                                                            
 1206 nagios    20   0 50664 1968  940 S  0.7  0.0   0:37.13 ndo2db                                                                            
  926 postgres  20   0  210m 5664 4016 S  0.3  0.0   0:00.03 postmaster                                                                        
 1949 postgres  20   0  210m 5656 4012 S  0.3  0.0   0:00.03 postmaster                                                                        
 2002 postgres  20   0  210m 5636 3992 S  0.3  0.0   0:00.03 postmaster                                                                        
13614 root      20   0 15980 2232  996 S  0.3  0.0   1:39.44 top                                                                               
18367 postgres  20   0  210m 5684 4040 S  0.3  0.0   0:00.11 postmaster                                                                        
    1 root      20   0 19272 1088  896 S  0.0  0.0   0:01.46 init                                                                              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                          
    3 root      20   0     0    0    0 S  0.0  0.0   1:07.45 ksoftirqd/0                                                                       
    5 root      20   0     0    0    0 S  0.0  0.0   0:01.32 kworker/u:0                                                                       

Code: Select all

[root@Nagios admin]# top -M 

top - 10:08:11 up 17 days,  9:28,  3 users,  load average: 20.82, 15.36, 17.14
Tasks: 230 total,   1 running, 228 sleeping,   1 stopped,   0 zombie
Cpu0  :  3.0%us,  0.3%sy,  0.0%ni, 94.6%id,  2.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 11.0%us,  0.7%sy,  0.0%ni, 85.0%id,  3.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  :  6.3%us,  0.7%sy,  0.0%ni, 93.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 12.9%us,  0.7%sy,  0.0%ni, 86.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    15.571G total,   13.970G used, 1639.883M free,  129.621M buffers
Swap:   31.248G total,   25.055M used,   31.224G free, 7332.289M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                          
 1052 apache    20   0  349m  32m 7780 S  2.7  0.2   0:02.46 httpd                                                                             
 2066 apache    20   0  350m  32m 7600 S  2.7  0.2   0:03.88 httpd                                                                             
14570 apache    20   0  349m  31m 7572 S  2.7  0.2   0:00.81 httpd                                                                             
14603 apache    20   0  350m  33m 7692 S  2.7  0.2   0:01.04 httpd                                                                             
14615 apache    20   0  350m  32m 7668 S  2.7  0.2   0:00.73 httpd                                                                             
32421 apache    20   0  349m  32m 7620 S  2.7  0.2   0:02.35 httpd                                                                             
32428 apache    20   0  350m  33m 7640 S  2.7  0.2   0:03.94 httpd                                                                             
 1019 apache    20   0  350m  33m 7744 S  2.3  0.2   0:02.26 httpd                                                                             
 1053 apache    20   0  349m  32m 7768 S  2.3  0.2   0:02.43 httpd                                                                             
 3681 apache    20   0  352m  35m 7864 S  2.3  0.2   0:10.23 httpd                                                                             
14596 apache    20   0  350m  32m 7556 S  2.3  0.2   0:00.93 httpd                                                                             
14597 apache    20   0  350m  32m 7556 S  2.3  0.2   0:00.70 httpd                                                                             
14602 apache    20   0  350m  32m 7588 S  2.0  0.2   0:02.24 httpd                                                                             
15410 mysql     20   0 2201m  60m 6140 S  1.0  0.4   7:55.23 mysqld                                                                            
  554 postgres  20   0  210m 5664 4024 S  0.3  0.0   0:00.09 postmaster                                                                        
 2003 postgres  20   0  210m 5676 4028 S  0.3  0.0   0:00.09 postmaster                                                                        
 2752 postgres  20   0  208m 4592 4356 S  0.3  0.0   1:20.68 postmaster                                                                        
14887 postgres  20   0  210m 5656 4008 S  0.3  0.0   0:00.02 postmaster                                                                        
    1 root      20   0 19272 1088  896 S  0.0  0.0   0:01.46 init                                                                              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                          
Old System Specs
  • Onsite Dell PowerEdge (model unknown)
    Dual Core i386 CPU
    2GB RAM
    Hardware RAID1
    FreeBSD 4.8 RELEASE (i386)
    Nagios Core 1.2
New System Specs
  • Remote Dedicated Server
    Modern Quad Core x86_64 CPU
    16GB RAM
    Software RAID1
    CentOS 6.3 x86_64
    Nagios Core 3.4.1
    Nagios XI 2012R1.4
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Seriously High Load Average

Post by scottwilkerson »

How many checks are in the new system and at what interval are the checks spaced out?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
chrisp
Posts: 71
Joined: Fri Dec 28, 2012 11:35 am

Re: Seriously High Load Average

Post by chrisp »

Good question!

Our old system has 1300 service checks and is predominately based around a 5 minute check interval. That works out at about 260 checks per minute...

The new system has 1200 service checks and is predominately based around a 1 minute check interval... That maths is fairly simple! This would logically seem to be about 4.6 times more load...

My colleague Gavin has just sent me this (http://assets.nagios.com/downloads/nagi ... giosXI.pdf) which seems like it's probably our only prudent course of action.

Do you agree with this hypothesis or have any further advice?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Seriously High Load Average

Post by scottwilkerson »

Yes that is a correct hypothesis, and the RAM Disk is a good first step. I would look at implementing as much of the following as I could
http://assets.nagios.com/downloads/nagi ... p#boosting

Also, if you could reduce the frequency to 2 minutes, you should effectively cut the load in 1/2....

Additionally, you may have some added load over the previous system as XI uses ndoutils to write a bunch of historical information to the MySQL database. You may or may not have had this on your Core install.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
chrisp
Posts: 71
Joined: Fri Dec 28, 2012 11:35 am

Re: Seriously High Load Average

Post by chrisp »

Thanks for the advice.

I implemented the RAM disk yesterday evening and the drop in load was fairly dramatic & I'd like to have been able to show off the results in graph form, but somehow I have caused the graphs to stop working (example attached): -
RAMdiskBrokegraphsMaybe.png
Also, this morning I tweaked most of the check and retry intervals to 5 mins, because even though there are some checks we have to have on 1 min check & retry intervals, they're the minority & an oversight on my part when introducing the 5/5 min service checks.

So, we're hugely better off now and the system's actually usable now, but graphs are now my main focus to fix (though not exactly sure where to start looking).

The RAM disk document was excellent and I greatly appreciate the cut-and-paste of it all, but the pedant in me, notes that the "chown" commands should have a colon (:) separating the name & group, rather than a dot (.) for clarity.
You do not have the required permissions to view the files attached to this post.
User avatar
chrisp
Posts: 71
Joined: Fri Dec 28, 2012 11:35 am

Re: Seriously High Load Average

Post by chrisp »

I've just gone through the system config with my colleague Gavin (a new set of eyes and he better understands the graphing setup), to see if I'd made any mistakes.

He installed "rrdcached" using the Using rrdcached with Nagios XI document.

We tried disabling rrdcached to see if direct graphing would work, but no.

The problem is not that the graphs are not drawing, but that there is no data. I am stuck now...

BTW, where that rrdcached document referrs to "/usr/local/nagios/etc/pnp/process_perfdata.conf", it's actually "/usr/local/nagios/etc/pnp/process_perfdata.cfg" (which is a bit misleading).
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Seriously High Load Average

Post by mguthrie »

I would start with retracing the steps for the RAM disk, particularly the section on offloading the performance data processing. There are quite a few steps to that, so it's easy to miss something there.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Seriously High Load Average

Post by scottwilkerson »

Do you see the journal with

Code: Select all

ls -l /tmp/rrd.*
Is rrdcached running?

Code: Select all

service rrdcached status
Also can you show me the output of the following

Code: Select all

ls -l /var/rrdtool/rrdcached/rrdcached.sock
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
chrisp
Posts: 71
Joined: Fri Dec 28, 2012 11:35 am

Re: Seriously High Load Average

Post by chrisp »

The journal is now in the RAMdisk: -

Code: Select all

# ll /tmp/rrd.* ; ll /var/nagiosramdisk/tmp/rrd.*
zsh: no matches found: /tmp/rrd.*
-rw-r--r-- 1 nagios nagios 4.0K Jan 23 15:41 /var/nagiosramdisk/tmp/rrd.journal.1358955652.595930
I have a funny feeling that there's been a stuck "rrdcached" process since before the RAMdisk was implemented (I recognize PID 7400 from yesterday)...

Code: Select all

# service rrdcached status
rrdcached (pid 27627 7400) is running...
# service rrdcached stop  
rrdcached (pid 27627 7400) is running...
Stopping rrdcached:                                        [  OK  ]
# service rrdcached status
rrdcached (pid 7400) is running...
# service rrdcached stop  
rrdcached (pid 7400) is running...
Stopping rrdcached:                                        [FAILED]
# service rrdcached status
rrdcached (pid 7400) is running...
# killall rrdcached
# service rrdcached status
rrdcached is stopped
# service rrdcached start 
rrdcached is stopped
Starting rrdcached:                                        [  OK  ]
# service rrdcached status
rrdcached (pid 11065) is running...

Code: Select all

# ll /var/rrdtool/rrdcached/rrdcached.sock
srwxr-xr-x 1 nagios users 0 Jan 23 15:40 /var/rrdtool/rrdcached/rrdcached.sock=
I went back through the RAMdisk instructions a 3rd time and noticed that I had incorrectly altered the "process-[host|service]-perfdata-file-bulk" commands (I'd altered the 2nd path, but not the 1st one after "/bin/mv").

I was also doing some desparate Googling and found this which also mentioned the "process-[host|service]-perfdata-file-pnp-bulk" commands.

I have just confirmed that I made an absolute hash of these command entry modifications and altered them thusly (though somehow it's taken 4 goes at changing them, due to some unbelievably clumsy editing - and I even appear to have fixed, then broke it straight after, about 3 hours ago): -

Code: Select all

# grep --color=auto "/var/nagiosramdisk" -B1 /usr/local/nagios/etc/commands.cfg
       command_name                             process-host-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host
--
       command_name                             process-host-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/perfdata/host-perfdata.$TIMET$
--
       command_name                             process-service-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
--
       command_name                             process-service-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/perfdata/service-perfdata.$TIMET$
Look, I have graphs again: -
RAMdiskBrokegraphsBecauseIAmAnIdiot.png
I've lost a day's worth of data, but I've learned some lessons...

Thanks for your help and advice. Writing to this forum post (and not wanting to miss any detail), has been a great help in my diagnostic process.
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Seriously High Load Average

Post by scottwilkerson »

Glad you got it working!
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart