Checks stop running randomly

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Re: Checks stop running randomly

Post by gwakem »

The attached pictures were taken at 09:25 MST to give you an idea of what were seeing. We disabled passive to rule out any possibility of that causing issues, but didn't force a recheck, as there was no guarantee it would freeze again.
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Checks stop running randomly

Post by mguthrie »

I'd like to rule out DB corruption as a possibility, lets go ahead and run the following procedure and make sure that's not causing an issue.

http://assets.nagios.com/downloads/nagi ... tabase.pdf


Also, can you do a running tail on your system log and nagios log and see if there are any ndo2db related errors showing up?

Code: Select all

tail -f /var/log/messages
tail -f /usr/local/nagios/var/nagios.log
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Re: Checks stop running randomly

Post by gwakem »

We stopped nagios and ndo2db, and ran the repair:

Code: Select all

[root@sidhqmonadm0 ~]# service mysqld stop
Stopping MySQL:                                            [  OK  ]
[root@sidhqmonadm0 ~]# ps ax |grep mysqld
25900 pts/1    S+     0:00 grep mysqld
[root@sidhqmonadm0 ~]# ./repairmysql.sh nagios *
DATABASE: nagios
TABLE:    
/var/lib/mysql/nagios ~
Stopping MySQL:                                            [FAILED]
- recovering (with sort) MyISAM-table 'nagios_acknowledgements.MYI'
Data records: 2458
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_commands.MYI'
Data records: 156
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_commenthistory.MYI'
Data records: 28805
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_comments.MYI'
Data records: 2061
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_configfiles.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_configfilevariables.MYI'
Data records: 138
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_conninfo.MYI'
Data records: 1940
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_contact_addresses.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactgroup_members.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactgroups.MYI'
Data records: 36
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contact_notificationcommands.MYI'
Data records: 1056
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactnotificationmethods.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactnotifications.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_contacts.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactstatus.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_customvariables.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_customvariablestatus.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with keycache) MyISAM-table 'nagios_dbversion.MYI'
Data records: 0
          
---------

- recovering (with sort) MyISAM-table 'nagios_downtimehistory.MYI'
Data records: 950
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_eventhandlers.MYI'
Data records: 11
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_externalcommands.MYI'
Data records: 1832
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_flappinghistory.MYI'
Data records: 6137
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostchecks.MYI'
Data records: 980
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_contactgroups.MYI'
Data records: 1986
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_contacts.MYI'
Data records: 142
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostdependencies.MYI'
Data records: 1424
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalation_contactgroups.MYI'
Data records: 1024
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalation_contacts.MYI'
Data records: 1002
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalations.MYI'
Data records: 466
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostgroup_members.MYI'
Data records: 2568
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostgroups.MYI'
Data records: 253
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_parenthosts.MYI'
Data records: 1470
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hosts.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_hoststatus.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
          
---------

- recovering (with sort) MyISAM-table 'nagios_instances.MYI'
Data records: 1
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_logentries.MYI'
Data records: 911988
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_notifications.MYI'
Data records: 120266
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
          
---------

- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 11624
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
          
---------

- recovering (with sort) MyISAM-table 'nagios_processevents.MYI'
Data records: 11406
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_programstatus.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_runtimevariables.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_scheduleddowntime.MYI'
Data records: 338
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicechecks.MYI'
Data records: 3034
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_service_contactgroups.MYI'
Data records: 7009
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_service_contacts.MYI'
Data records: 1407
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicedependencies.MYI'
Data records: 620
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contactgroups.MYI'
Data records: 4508
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contacts.MYI'
Data records: 3995
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalations.MYI'
Data records: 1756
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicegroup_members.MYI'
Data records: 317
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicegroups.MYI'
Data records: 51
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_services.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicestatus.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
          
---------

- recovering (with sort) MyISAM-table 'nagios_statehistory.MYI'
Data records: 450349
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_systemcommands.MYI'
Data records: 98
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_timedeventqueue.MYI'
Data records: 4610
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
          
---------

- recovering (with sort) MyISAM-table 'nagios_timedevents.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
          
---------

- recovering (with sort) MyISAM-table 'nagios_timeperiods.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_timeperiod_timeranges.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
Starting MySQL:                                            [  OK  ]
~
 
===============
REPAIR COMPLETE
===============
[root@sidhqmonadm0 ~]# service mysqld start
Starting MySQL:                                            [  OK  ]
When we restarted nagios and ndo2db, we found some oddness; host checks were not running, and service checks were not found underneath the hosts. We confirmed that the services still existed and were enabled. It took an apply (with no changes, just an apply) to bring everything back into place and get it running again. Attaching screenshots of strangeness.
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
User avatar
KevinD
Posts: 26
Joined: Thu Mar 29, 2012 10:26 am

Re: Checks stop running randomly

Post by KevinD »

As far as the log tails... here is the output, it looks like something happened during the initial startup, and was cleared by the apply.

Start after DB repair

Code: Select all

[2012/06/18 09:44:49] Caught SIGTERM, shutting down...
[2012/06/18 09:44:49] Successfully shutdown... (PID=24442)
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:44:58] ndomod: Shutdown complete.
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:47:11] Nagios 3.4.1 starting... (PID=8676)
[2012/06/18 09:47:11] Local time is Mon Jun 18 09:47:11 MDT 2012
[2012/06/18 09:47:11] LOG VERSION: 2.0
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:47:11] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:47:11] ndomod: Could not open data sink!  I'll keep trying, but some output may get lost...
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[2012/06/18 09:47:27] ndomod: Successfully connected to data sink.  25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
After Apply:

Code: Select all

[2012/06/18 09:56:46] Caught SIGTERM, shutting down...
[2012/06/18 09:56:46] Successfully shutdown... (PID=8690)
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:56:47] ndomod: Shutdown complete.
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:56:49] Nagios 3.4.1 starting... (PID=17907)
[2012/06/18 09:56:49] Local time is Mon Jun 18 09:56:49 MDT 2012
[2012/06/18 09:56:49] LOG VERSION: 2.0
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:56:49] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:56:49] ndomod: Successfully connected to data sink.  0 queued items to flush.
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
However, we started nagios before starting NDO, so this should simply be from that NDO was not running yet (we see this in the log when we startup in that order)
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Re: Checks stop running randomly

Post by gwakem »

Please note that we ran the repair on the nagios database only, and not the nagiosql DB. If you believe we should try that also, we can.
--
Griffin Wakem
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Checks stop running randomly

Post by mguthrie »

Yeah, we have been seeing some temporary oddness with ndoutils after the upgrade to 3.x on larger systems. Your culprit is probably right here:

Code: Select all

[2012/06/18 09:47:27] ndomod: Successfully connected to data sink.  25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
The latest version of ndoutils uses asynchronous writes, so there may have been some oddness on the system at a pretty low level for a little bit. Are you still seeing the inconsistency in the check times?
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Re: Checks stop running randomly

Post by gwakem »

We are.. Even after running an apply, checks still randomly stop. They will run for a while and then exhibit the "freeze" issue where last check and next check are both at the same time, and in the past.
--
Griffin Wakem
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Checks stop running randomly

Post by mguthrie »

Can you send us the grep of your system log?

Code: Select all

cat /var/log/messages | grep ndo2db
If ndoutils is dropping any data it would show up there. Can you also post what values you have in /etc/sysctl.conf for the following?

Code: Select all

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Re: Checks stop running randomly

Post by gwakem »

Output from the messages.log for the last few days:

Code: Select all

Jun 12 17:46:56 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND event_type='0' AND scheduled_time=FROM_UNIXTIME(1339549410) AND recurring_event='0' AND object_id='7625'' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND scheduled_time<FROM_UNIXTIME(1339549410)' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='59756811', lines_processed='5893057', entries_processed='185956' WHERE conninfo_id='1877'' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:33:25 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:43:38 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:46:32 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:48:18 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:52:08 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:58:28 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:05:16 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:10:30 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:14:45 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0'' 
Jun 18 07:47:19 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
The last entry for 07:47 was where someone came in, saw that graphs had stopped, and did an apply.

/etc/sysctl.conf

Code: Select all

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000
--
Griffin Wakem
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Checks stop running randomly

Post by mguthrie »

Try adding the following lines to /etc/my.cnf underneath

Code: Select all

[mysqld]

Code: Select all

max_connections=200
connect_timeout=30
Then restart mysqld

Code: Select all

service mysqld restart
It appears as though you're either hitting the max connections for mysql or the connections are timing out. Is your mysql on the same server as XI, or is it offloaded to a 2nd machine? (If it's offloaded, run the above commands on the remote machine).