We recently upgraded form NagiosXI 2.3 to r3.1. Post upgrade, we noticed that graphs would randomly stop updating for quite a few of our host's services, and at completely random times. upon investigation, we discovered that this was due to the checks randomly stopping, but not going crit. They just kind of get "stuck".
For instance, it is 12:56pm MST right now, and when I look at a check that seems to have stuck, I will see the following (and its in a OK state:)
Last Check: 2012-06-15 11:13:02
Next Check: 2012-06-15 11:13:02
Forcing a recheck corrects the issue, and fills in the graph with a straight line. However, this is happening across a multitude of hosts, and the only way we know to find them is if we see the perfdata stop being updated. Has anyone noticed this issue with r3.1, or can offer assistance on where I could start looking for the root of the issue?
Checks stop running randomly
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Checks stop running randomly
--
Griffin Wakem
Griffin Wakem
-
- Former Nagios Staff
- Posts: 13589
- Joined: Mon May 23, 2011 12:15 pm
Re: Checks stop running randomly
From the Nagios XI Home page, click on both, "Process Info" and "Performance" under the "Monitoring Process" menu on the left-hand site, and post screenshots of these two screens.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
Screenshot's Attached.
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
Griffin Wakem
-
- Former Nagios Staff
- Posts: 13589
- Joined: Mon May 23, 2011 12:15 pm
Re: Checks stop running randomly
Everything looks normal. So, run the following commands and post the output:
Also run the following commands:
See if this fixes your issue.
Code: Select all
ps -ef | grep bin/nagios
tail /var/log/messages
tail /usr/local/nagios/var/nagios.log
Code: Select all
service nagios stop
service ndo2db stop
killall -9 nagios
killall -9 ndo2db
service nagios start
service ndo2db start
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
Did you want the logging while Nagios was restarted, or just the snippet of the last few lines in the log before the restart?
[root@sidhqmonm0 services]# ps -ef | grep bin/nagios
nagios 4860 29250 0 14:27 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 4923 9878 0 14:27 pts/8 00:00:00 grep bin/nagios
nagios 29250 1 4 14:19 ? 00:00:22 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
We followed the steps given above, but the next apply we did caused all services to start, but the Process State under MOnitoring engine process to remain stopped. cliecking start would do nothing but attempt to restart the daemons, which would fail and leave zombie processes, as the main process would die. We eventually got the system back into a running state by rebooting and applying once. Before we applied, we lost the google map pinpointing functionality, BPI, etc. Once we got the system stable again, we left, only to find nearly all of the graphing and checks have stopped over the weekend.
The system isn't overloaded at all, so I'm not sure why the checks would get "stuck" and stop. Perhaps we should roll back to r2.4 until we get this figured out?
[root@sidhqmonm0 services]# ps -ef | grep bin/nagios
nagios 4860 29250 0 14:27 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 4923 9878 0 14:27 pts/8 00:00:00 grep bin/nagios
nagios 29250 1 4 14:19 ? 00:00:22 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
We followed the steps given above, but the next apply we did caused all services to start, but the Process State under MOnitoring engine process to remain stopped. cliecking start would do nothing but attempt to restart the daemons, which would fail and leave zombie processes, as the main process would die. We eventually got the system back into a running state by rebooting and applying once. Before we applied, we lost the google map pinpointing functionality, BPI, etc. Once we got the system stable again, we left, only to find nearly all of the graphing and checks have stopped over the weekend.
The system isn't overloaded at all, so I'm not sure why the checks would get "stuck" and stop. Perhaps we should roll back to r2.4 until we get this figured out?
--
Griffin Wakem
Griffin Wakem
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
Due to the severity of this issue to us, if it is preferred that I open a ticket, please let us know.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 4380
- Joined: Mon Jun 14, 2010 10:21 am
Re: Checks stop running randomly
What results do you get from the following commands:
Can you also access the Admin->System Profile page and send us the output from the text download? (Feel free to remove any host name information if it's public).
Code: Select all
service nagios restart
service npcd restart
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
[root@sidhqmonm0 ~]# service nagios restart
Running configuration check...done.
Stopping nagios: ..done.
Starting nagios: done.
root@sidhqmonm0 ~]# ps -ef |grep bin/nagios
nagios 24442 1 9 08:51 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 24725 2835 0 08:52 pts/0 00:00:00 grep bin/nagios
[root@sidhqmonm0 ~]# service npcd restart
NPCD Stopped.
NPCD started.
[root@sidhqmonm0 ~]# ps -ef |grep bin/npcd
nagios 24765 1 0 08:52 ? 00:00:00 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root 25274 2835 0 08:52 pts/0 00:00:00 grep bin/npcd
Edit: Attached Profile.txt
Running configuration check...done.
Stopping nagios: ..done.
Starting nagios: done.
root@sidhqmonm0 ~]# ps -ef |grep bin/nagios
nagios 24442 1 9 08:51 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 24725 2835 0 08:52 pts/0 00:00:00 grep bin/nagios
[root@sidhqmonm0 ~]# service npcd restart
NPCD Stopped.
NPCD started.
[root@sidhqmonm0 ~]# ps -ef |grep bin/npcd
nagios 24765 1 0 08:52 ? 00:00:00 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root 25274 2835 0 08:52 pts/0 00:00:00 grep bin/npcd
Edit: Attached Profile.txt
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
I was able to confirm that a check is stuck at 08:53:43 and it's 08:59 MST right now, so even immediately after the above manual restarts, it still seems to be popping up. I checked Nagios core, thinking maybe it was an issue in postgres and isolated to NagiosXI, but core shows mostly the same thing with one exception:
NagiosXI:
Last Check: 2012-06-18 08:53:43
Next Check: 2012-06-18 08:53:43
Last State Change: 2012-06-08 13:57:54
Core states:
Next Scheduled Check: 06-18-2012 08:53:43
Last State Change: 06-08-2012 13:57:54
Last Update: 06-18-2012 09:03:05 ( 0d 0h 0m 2s ago)
NagiosXI:
Last Check: 2012-06-18 08:53:43
Next Check: 2012-06-18 08:53:43
Last State Change: 2012-06-08 13:57:54
Core states:
Next Scheduled Check: 06-18-2012 08:53:43
Last State Change: 06-08-2012 13:57:54
Last Update: 06-18-2012 09:03:05 ( 0d 0h 0m 2s ago)
--
Griffin Wakem
Griffin Wakem
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
I ran a test to verify that the checks that were stuck were actually not checking.
One of the checks that froze is our DNX process check on one of our child servers (which were also upgraded.) I manually stopped the process, and waited for ten minutes, confirming that the service didn't change state. It still says
Last Check: 2012-06-18 08:56:27
Next Check: 2012-06-18 08:56:27
One of the checks that froze is our DNX process check on one of our child servers (which were also upgraded.) I manually stopped the process, and waited for ten minutes, confirming that the service didn't change state. It still says
Last Check: 2012-06-18 08:56:27
Next Check: 2012-06-18 08:56:27
--
Griffin Wakem
Griffin Wakem