Upgrade from 2011 to 2012 Failed

JulianFDRacing · Post by **JulianFDRacing** » Wed Oct 24, 2012 2:44 pm

The ESX infrastructure does appear to be causing issues and I hope you are correct, I've rattled a few cages today and hopefully will have a dedicated ESX host early next week, I'm currently working on a fresh XI 2012 VM that I can restore our config to as when I tested using this process I had no issues and a couple of other minor issues were resolved, the current production box has always been a bit of an unknown quantity since I started here so think this is the best course of action going forward, npcd has been runnng consistently but as the load index on production has been in the mid 20's all day its highly likely that its grinding everything to a slow crawl, I think I've checked everything I can and everything does seem to work, just eventually, I've noticed a number of times that login and command response times have been poor today, all pointing to the VM performance...

Perhaps the hardware requirements for 2012 are slightly higher and we've experienced the straw that broke the camels back during the upgrade. Hope so

Post by **lmiltchev** » Wed Oct 24, 2012 3:04 pm

Here's the official Nagios XI hardware requirements:

http://assets.nagios.com/downloads/nagi ... ements.pdf

What's your system like? Do you meet (or exceed) these requirements?

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 3:25 am

lmiltchev wrote:Here's the official Nagios XI hardware requirements:

http://assets.nagios.com/downloads/nagi ... ements.pdf

What's your system like? Do you meet (or exceed) these requirements?

4 core 4Gb, 40Gb drive, currently 175/2133 host/services, about 300 passive, rest active, one of the issues we do have is that we are monitoring servers in Australia from our UK base, latency has always been high.

UPDATE - The graphs have not updated since yesterday afternoon, down about 24 hours data now, the rrd files have a recent timestamp and NPCD is still running, I'm going to copy a group of rrd files to a test box I've been working on and seeing how it "looks" but even running on poor hardware I'd expect something to have come through, all the host and service checks appear to be working as normal

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 3:55 am

This may give a clue, seems increasing the timeout value to 15 may have done more harm than good...

Code: Select all

2012-10-24 21:48:42 [3852] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-24 21:50:19 [4893] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-24 21:50:19 [4893] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-24 21:50:19 [4893] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-24 21:50:19 [4893] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351083622.perfdata.service-PID-4893 deleted
2012-10-24 21:50:19 [4893] [0] *** Timeout while processing Host: "asc-jadedev1.int.ascribe.com" Service: "Page_File_Usage"
2012-10-24 21:50:19 [4893] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-24 21:52:48 [6520] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-24 21:52:48 [6520] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-24 21:52:48 [6520] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-24 21:52:48 [6520] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351083763.perfdata.service-PID-6520 deleted
2012-10-24 21:52:48 [6520] [0] *** Timeout while processing Host: "CORE-TFS" Service: "Drive_H__Disk_Usage"
2012-10-24 21:52:48 [6520] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-24 21:53:05 [6620] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-24 21:53:05 [6620] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-24 21:53:05 [6620] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-24 21:53:05 [6620] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351083777.perfdata.service-PID-6620 deleted
2012-10-24 21:53:05 [6620] [0] *** Timeout while processing Host: "localhost" Service: "Avg_HostExecTime"
2012-10-24 21:53:05 [6620] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-24 21:53:44 [6889] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-24 21:53:44 [6889] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-24 21:53:44 [6889] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-24 21:53:44 [6889] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351083807.perfdata.service-PID-6889 deleted
2012-10-24 21:53:44 [6889] [0] *** Timeout while processing Host: "cnllhrs5.cnl.cnw.co.nz" Service: "Check_Temp_-_ioBoard"
2012-10-24 21:53:44 [6889] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-24 21:53:44 [6896] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-24 21:53:44 [6896] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-24 21:53:44 [6896] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-24 21:53:44 [6896] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351083824.perfdata.service-PID-6896 deleted
2012-10-24 21:53:44 [6896] [0] *** Timeout while processing Host: "cnllhrs4.cnl.cnw.co.nz" Service: "Memory_Usage"
2012-10-24 21:53:44 [6896] [0] *** process_perfdata.pl terminated on signal ALRM
2012-10-25 07:07:38 [5895] [0] *** TIMEOUT: Timeout after 15 secs. ***
2012-10-25 07:07:38 [5895] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-25 07:07:38 [5895] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-25 07:07:38 [5895] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351107834.perfdata.service-PID-5895 deleted
2012-10-25 07:07:38 [5895] [0] *** Timeout while processing Host: "mel-jadedev2.int.ascribe.com" Service: "Drive_H__Disk_Usage"
2012-10-25 07:07:38 [5895] [0] *** process_perfdata.pl terminated on signal ALRM

I've turned on logging level 1 so I can some clues as to why its erroring

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 5:19 am

Ran for about an hour and then stalled at this stage, restarted NPCD and then Nagios but still stalled and think it will only start again after a reboot

Code: Select all

2012-10-25 10:05:09 [1348] [1] Found Performance Data for core-email.int.ascribe
.com / Drive_D__Disk_Usage (D:\ Used Space=0.12Gb;11.72;13.18;0.00;14.65)
2012-10-25 10:05:09 [1348] [1] Found Performance Data for gri_acu-fp-10.acute.xg
lasgow.scot.nhs.uk / CPU_Usage (5 min avg Load=1%;85;95;0;100)
2012-10-25 10:05:12 [1348] [1] Found Performance Data for cnllhrs2.cnl.cnw.co.nz
 / Drive_D__Ops_Tools (D:\ Used Space=28.04Gb;31.22;35.13;0.00;39.03)
2012-10-25 10:05:17 [1348] [0] *** TIMEOUT: Timeout after 5 secs. ***
2012-10-25 10:05:17 [1348] [0] *** TIMEOUT: Deleting current file to avoid NPCD
loops
2012-10-25 10:05:17 [1348] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-25 10:05:17 [1348] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata
//1351123012.perfdata.service-PID-1348 deleted
2012-10-25 10:05:17 [1348] [0] *** Timeout while processing Host: "cnllhrs2.cnl.
cnw.co.nz" Service: "Drive_D__Ops_Tools"
2012-10-25 10:05:17 [1348] [0] *** process_perfdata.pl terminated on signal ALRM

Also noted that the perfdata.log file timestamp is 10:05 but host-perfdata and service-perfdata have a current timestamp of 11:20

This may give some clues

Code: Select all

drwxrwxr-x 7 nagios nagios     4096 Oct 25 11:23 .
drwxr-xr-x 9 root   root       4096 Oct  9  2011 ..
drwxrwxr-x 2 nagios nagios    20480 Oct 25 00:00 archives
drwxr-xr-x 2 nagios nagios     4096 Dec 22  2011 archives.old
-rw-r--r-- 1 apache apache    47880 Oct 13 11:07 graphapi.log
-rw-rw-r-- 1 nagios users       248 Oct 25 11:23 host-perfdata
-rw-r--r-- 1 root   root      11060 May 17 08:54 nagios.debug
-rw-r--r-- 1 nagios users         5 Oct 25 11:16 nagios.lock
-rw-r--r-- 1 nagios nagios        5 Oct 25 09:10 ndo2db.lock
-rw-rw-r-- 1 nagios users         0 Oct 25 11:16 ndomod.tmp
srwxr-xr-x 1 nagios nagios        0 Oct 25 09:10 ndo.sock
-rw-r--r-- 1 nagios nagios  7346089 Oct 25 11:23 npcd.log
-rw-r--r-- 1 nagios nagios 10485783 Aug 31 08:11 npcd.log.old
-rw-r--r-- 1 nagios nagios  2145277 Oct 25 11:16 objects.cache
-rw-rw-rw- 1 nagios nagios  7459491 Oct 25 10:05 perfdata.log
-rw------- 1 nagios users   3485074 Oct 25 11:16 retention.dat
drwxrwsr-x 2 nagios nagcmd     4096 Jun 18 09:42 rw
-rw-rw-r-- 1 nagios users      4050 Oct 25 11:23 service-perfdata
drwxr-xr-x 5 nagios nagios     4096 Jan 26  2011 spool
drwxr-xr-x 2 nagios nagios     4096 Oct 25 10:05 stats
-rw-rw-r-- 1 nagios users   3402524 Oct 25 11:23 status.dat

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 8:04 am

its stalled again but with no errors in the log this time

Code: Select all

2012-10-25 13:12:06 [1058] [1] Found Performance Data for test-esx.int.ascribe.c
om / VMware_Host_Current_Datastore_datastore1_Usage (datastore1-free=14087723417
6B;3;1;0;141465485312 datastore1=588251136B;141465485309;141465485311;0;14146548
5312)
2012-10-25 13:12:06 [1058] [1] Found Performance Data for ascribe-esx2.int.ascri
be.com / VMware_Host_Current_Datastore_vmfs05_Usage (VMFS_05-free=20057161728B;3
;1;0;549487378432 VMFS_05=529430216704B;549487378429;549487378431;0;549487378432
)
2012-10-25 13:12:06 [1058] [1] Found Performance Data for localhost / PassiveSer
viceChecks_1mn (Passive_Checks_1mn=0;;;)
2012-10-25 13:12:06 [1058] [1] Found Performance Data for tstesting-web2.elt / I
IS_Web_Server_Connections (CurrentConnections=0; _ConnectionAttemptsPersec=0;)
2012-10-25 13:12:06 [1058] [1] 127 lines processed

Its now14:02, its caught up to about 3am in the morning but its never going to catch up completely if there services keep failing and NagiosXI is none the wiser that it has stalled, XI still says everything OK, NPCD is running (pid 3567)

How can I restart this process without restarting the whole server?

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 10:11 am

Code: Select all

2012-10-25 15:43:55 [32332] [1] Found Performance Data for tstesting-db1.elt / CPU_Usage (5 min avg Load=3%;85;95;0;100)
2012-10-25 15:43:55 [32332] [1] Found Performance Data for core-email.int.ascribe.com / Memory_Usage (Memory usage=5770.10Mb;27013.89;30191.99;0.00;31781.05)
2012-10-25 15:43:55 [32332] [1] Found Performance Data for ascribesql.xchristie.nhs.uk / CPU_Usage (5 min avg Load=19%;85;95;0;100)
2012-10-25 15:43:55 [32332] [1] Found Performance Data for cnllhrs4.cnl.cnw.co.nz / CPU_Usage (5 min avg Load=0%;85;95;0;100)
2012-10-25 15:43:55 [32332] [1] Found Performance Data for gri_acu-fp-10.acute.xglasgow.scot.nhs.uk / CPU_Usage (5 min avg Load=1%;85;95;0;100)
2012-10-25 15:43:55 [32332] [1] Found Performance Data for dev-esx.int.ascribe.com / VMware_Host_Current_Datastore_DEV2_Usage (DEV2-free=1058624503808B;3;1;0;1466462896128 DEV2=40783
8392320B;1466462896125;1466462896127;0;1466462896128)
2012-10-25 15:43:57 [32332] [0] *** TIMEOUT: Timeout after 5 secs. ***
2012-10-25 15:43:57 [32332] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-10-25 15:43:57 [32332] [0] *** TIMEOUT: Please check your npcd.cfg
2012-10-25 15:43:57 [32332] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1351139712.perfdata.service-PID-32332 deleted
2012-10-25 15:43:57 [32332] [0] *** Timeout while processing Host: "dev-esx.int.ascribe.com" Service: "VMware_Host_Current_Datastore_DEV2_Usage"
2012-10-25 15:43:57 [32332] [0] *** process_perfdata.pl terminated on signal ALRM

And again, is anyone looking into this as I can't keep restarting the box for it to keep up, this worked perfectly before the XI upgrade so think other NagiosXI users need to be aware of this before they upgrade

mguthrie · Post by **mguthrie** » Thu Oct 25, 2012 10:22 am

What is your CPU load on the 5 and 15 minute average? By default NPCD with still pause it's performance data processing if the load is over 10. This can be adjusted in /usr/local/nagios/etc/pnp/npcd.cfg by adjusting the load threshold.

Are you seeing a lot of IOwait on the system? Is your ESX server heavily taxed for disk activity? If so thing could be getting backed up while waiting to write to disk.

Can you post the output from the following:

Code: Select all

cd /usr/local/nagios/var/spool/perfdata
ls -f | wc -l

cd /usr/local/nagios/var/spool/xidpe
ls -f | wc -l

Also, the verbose logging does add a substantial increase in disk activity and CPU usage on the system, so be sure to turn it off once it's no longer needed.

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 10:41 am

[root@NagiosXI perfdata]# cd /usr/local/nagios/var/spool/perfdata
[root@NagiosXI perfdata]# ls -f | wc -l
4687
[root@NagiosXI perfdata]#
[root@NagiosXI perfdata]# cd /usr/local/nagios/var/spool/xidpe
[root@NagiosXI xidpe]# ls -f | wc -l
4

JulianFDRacing · Post by **JulianFDRacing** » Thu Oct 25, 2012 10:45 am

JulianFDRacing wrote:[root@NagiosXI perfdata]# cd /usr/local/nagios/var/spool/perfdata
[root@NagiosXI perfdata]# ls -f | wc -l
4687
[root@NagiosXI perfdata]#
[root@NagiosXI perfdata]# cd /usr/local/nagios/var/spool/xidpe
[root@NagiosXI xidpe]# ls -f | wc -l
4

Although not convinced its a wise idea I've changed the load threshold to 20 just to see if this keeps running overnight

It may be worth adding something about this for users about to upgrade when they are already pushing the limits of their hardware

Nagios Support Forum

Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed

Re: Upgrade from 2011 to 2012 Failed