Checks stop running randomly

Post by **gwakem** » Fri Jun 15, 2012 1:59 pm

We recently upgraded form NagiosXI 2.3 to r3.1. Post upgrade, we noticed that graphs would randomly stop updating for quite a few of our host's services, and at completely random times. upon investigation, we discovered that this was due to the checks randomly stopping, but not going crit. They just kind of get "stuck".

For instance, it is 12:56pm MST right now, and when I look at a check that seems to have stuck, I will see the following (and its in a OK state:)

Last Check: 2012-06-15 11:13:02
Next Check: 2012-06-15 11:13:02

Forcing a recheck corrects the issue, and fills in the graph with a straight line. However, this is happening across a multitude of hosts, and the only way we know to find them is if we see the perfdata stop being updated. Has anyone noticed this issue with r3.1, or can offer assistance on where I could start looking for the root of the issue?

Post by **lmiltchev** » Fri Jun 15, 2012 2:20 pm

From the Nagios XI Home page, click on both, "Process Info" and "Performance" under the "Monitoring Process" menu on the left-hand site, and post screenshots of these two screens.

Post by **gwakem** » Fri Jun 15, 2012 2:37 pm

Screenshot's Attached.

Post by **lmiltchev** » Fri Jun 15, 2012 3:26 pm

Everything looks normal. So, run the following commands and post the output:

Code: Select all

ps -ef | grep bin/nagios
tail /var/log/messages
tail /usr/local/nagios/var/nagios.log

Also run the following commands:

Code: Select all

service nagios stop
service ndo2db stop
killall -9 nagios
killall -9 ndo2db
service nagios start
service ndo2db start

See if this fixes your issue.

Post by **gwakem** » Mon Jun 18, 2012 9:34 am

Did you want the logging while Nagios was restarted, or just the snippet of the last few lines in the log before the restart?

[root@sidhqmonm0 services]# ps -ef | grep bin/nagios
nagios 4860 29250 0 14:27 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 4923 9878 0 14:27 pts/8 00:00:00 grep bin/nagios
nagios 29250 1 4 14:19 ? 00:00:22 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

We followed the steps given above, but the next apply we did caused all services to start, but the Process State under MOnitoring engine process to remain stopped. cliecking start would do nothing but attempt to restart the daemons, which would fail and leave zombie processes, as the main process would die. We eventually got the system back into a running state by rebooting and applying once. Before we applied, we lost the google map pinpointing functionality, BPI, etc. Once we got the system stable again, we left, only to find nearly all of the graphing and checks have stopped over the weekend.

The system isn't overloaded at all, so I'm not sure why the checks would get "stuck" and stop. Perhaps we should roll back to r2.4 until we get this figured out?

Post by **gwakem** » Mon Jun 18, 2012 9:39 am

Due to the severity of this issue to us, if it is preferred that I open a ticket, please let us know.

mguthrie · Post by **mguthrie** » Mon Jun 18, 2012 9:41 am

What results do you get from the following commands:

Code: Select all

service nagios restart
service npcd restart

Can you also access the Admin->System Profile page and send us the output from the text download? (Feel free to remove any host name information if it's public).

Post by **gwakem** » Mon Jun 18, 2012 9:52 am

[root@sidhqmonm0 ~]# service nagios restart
Running configuration check...done.
Stopping nagios: ..done.
Starting nagios: done.
root@sidhqmonm0 ~]# ps -ef |grep bin/nagios
nagios 24442 1 9 08:51 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 24725 2835 0 08:52 pts/0 00:00:00 grep bin/nagios
[root@sidhqmonm0 ~]# service npcd restart
NPCD Stopped.
NPCD started.
[root@sidhqmonm0 ~]# ps -ef |grep bin/npcd
nagios 24765 1 0 08:52 ? 00:00:00 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root 25274 2835 0 08:52 pts/0 00:00:00 grep bin/npcd

Edit: Attached Profile.txt

Post by **gwakem** » Mon Jun 18, 2012 10:03 am

I was able to confirm that a check is stuck at 08:53:43 and it's 08:59 MST right now, so even immediately after the above manual restarts, it still seems to be popping up. I checked Nagios core, thinking maybe it was an issue in postgres and isolated to NagiosXI, but core shows mostly the same thing with one exception:

NagiosXI:

Last Check: 2012-06-18 08:53:43
Next Check: 2012-06-18 08:53:43
Last State Change: 2012-06-08 13:57:54

Core states:

Next Scheduled Check: 06-18-2012 08:53:43
Last State Change: 06-08-2012 13:57:54
Last Update: 06-18-2012 09:03:05 ( 0d 0h 0m 2s ago)

Post by **gwakem** » Mon Jun 18, 2012 10:19 am

I ran a test to verify that the checks that were stuck were actually not checking.

One of the checks that froze is our DNX process check on one of our child servers (which were also upgraded.) I manually stopped the process, and waited for ten minutes, confirming that the service didn't change state. It still says

Last Check: 2012-06-18 08:56:27
Next Check: 2012-06-18 08:56:27

Nagios Support Forum

Checks stop running randomly

Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly