Checks stop running randomly
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
The attached pictures were taken at 09:25 MST to give you an idea of what were seeing. We disabled passive to rule out any possibility of that causing issues, but didn't force a recheck, as there was no guarantee it would freeze again.
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 4380
- Joined: Mon Jun 14, 2010 10:21 am
Re: Checks stop running randomly
I'd like to rule out DB corruption as a possibility, lets go ahead and run the following procedure and make sure that's not causing an issue.
http://assets.nagios.com/downloads/nagi ... tabase.pdf
Also, can you do a running tail on your system log and nagios log and see if there are any ndo2db related errors showing up?
http://assets.nagios.com/downloads/nagi ... tabase.pdf
Also, can you do a running tail on your system log and nagios log and see if there are any ndo2db related errors showing up?
Code: Select all
tail -f /var/log/messages
tail -f /usr/local/nagios/var/nagios.log
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
We stopped nagios and ndo2db, and ran the repair:
When we restarted nagios and ndo2db, we found some oddness; host checks were not running, and service checks were not found underneath the hosts. We confirmed that the services still existed and were enabled. It took an apply (with no changes, just an apply) to bring everything back into place and get it running again. Attaching screenshots of strangeness.
Code: Select all
[root@sidhqmonadm0 ~]# service mysqld stop
Stopping MySQL: [ OK ]
[root@sidhqmonadm0 ~]# ps ax |grep mysqld
25900 pts/1 S+ 0:00 grep mysqld
[root@sidhqmonadm0 ~]# ./repairmysql.sh nagios *
DATABASE: nagios
TABLE:
/var/lib/mysql/nagios ~
Stopping MySQL: [FAILED]
- recovering (with sort) MyISAM-table 'nagios_acknowledgements.MYI'
Data records: 2458
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_commands.MYI'
Data records: 156
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_commenthistory.MYI'
Data records: 28805
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_comments.MYI'
Data records: 2061
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_configfiles.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_configfilevariables.MYI'
Data records: 138
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_conninfo.MYI'
Data records: 1940
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_contact_addresses.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_contactgroup_members.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_contactgroups.MYI'
Data records: 36
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_contact_notificationcommands.MYI'
Data records: 1056
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_contactnotificationmethods.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_contactnotifications.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
---------
- recovering (with sort) MyISAM-table 'nagios_contacts.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_contactstatus.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_customvariables.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_customvariablestatus.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with keycache) MyISAM-table 'nagios_dbversion.MYI'
Data records: 0
---------
- recovering (with sort) MyISAM-table 'nagios_downtimehistory.MYI'
Data records: 950
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_eventhandlers.MYI'
Data records: 11
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_externalcommands.MYI'
Data records: 1832
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_flappinghistory.MYI'
Data records: 6137
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_hostchecks.MYI'
Data records: 980
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_host_contactgroups.MYI'
Data records: 1986
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_host_contacts.MYI'
Data records: 142
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostdependencies.MYI'
Data records: 1424
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostescalation_contactgroups.MYI'
Data records: 1024
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostescalation_contacts.MYI'
Data records: 1002
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostescalations.MYI'
Data records: 466
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostgroup_members.MYI'
Data records: 2568
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hostgroups.MYI'
Data records: 253
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_host_parenthosts.MYI'
Data records: 1470
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_hosts.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_hoststatus.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
---------
- recovering (with sort) MyISAM-table 'nagios_instances.MYI'
Data records: 1
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_logentries.MYI'
Data records: 911988
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
---------
- recovering (with sort) MyISAM-table 'nagios_notifications.MYI'
Data records: 120266
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
---------
- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 11624
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
---------
- recovering (with sort) MyISAM-table 'nagios_processevents.MYI'
Data records: 11406
- Fixing index 1
---------
- recovering (with sort) MyISAM-table 'nagios_programstatus.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_runtimevariables.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_scheduleddowntime.MYI'
Data records: 338
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_servicechecks.MYI'
Data records: 3034
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
---------
- recovering (with sort) MyISAM-table 'nagios_service_contactgroups.MYI'
Data records: 7009
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_service_contacts.MYI'
Data records: 1407
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_servicedependencies.MYI'
Data records: 620
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contactgroups.MYI'
Data records: 4508
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contacts.MYI'
Data records: 3995
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_serviceescalations.MYI'
Data records: 1756
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_servicegroup_members.MYI'
Data records: 317
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_servicegroups.MYI'
Data records: 51
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_services.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_servicestatus.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
---------
- recovering (with sort) MyISAM-table 'nagios_statehistory.MYI'
Data records: 450349
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_systemcommands.MYI'
Data records: 98
- Fixing index 1
- Fixing index 2
- Fixing index 3
---------
- recovering (with sort) MyISAM-table 'nagios_timedeventqueue.MYI'
Data records: 4610
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
---------
- recovering (with sort) MyISAM-table 'nagios_timedevents.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
---------
- recovering (with sort) MyISAM-table 'nagios_timeperiods.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
---------
- recovering (with sort) MyISAM-table 'nagios_timeperiod_timeranges.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
Starting MySQL: [ OK ]
~
===============
REPAIR COMPLETE
===============
[root@sidhqmonadm0 ~]# service mysqld start
Starting MySQL: [ OK ]
You do not have the required permissions to view the files attached to this post.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 26
- Joined: Thu Mar 29, 2012 10:26 am
Re: Checks stop running randomly
As far as the log tails... here is the output, it looks like something happened during the initial startup, and was cleared by the apply.
Start after DB repair
After Apply:
However, we started nagios before starting NDO, so this should simply be from that NDO was not running yet (we see this in the log when we startup in that order)
Start after DB repair
Code: Select all
[2012/06/18 09:44:49] Caught SIGTERM, shutting down...
[2012/06/18 09:44:49] Successfully shutdown... (PID=24442)
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:44:58] ndomod: Shutdown complete.
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:47:11] Nagios 3.4.1 starting... (PID=8676)
[2012/06/18 09:47:11] Local time is Mon Jun 18 09:47:11 MDT 2012
[2012/06/18 09:47:11] LOG VERSION: 2.0
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:47:11] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:47:11] ndomod: Could not open data sink! I'll keep trying, but some output may get lost...
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[2012/06/18 09:47:27] ndomod: Successfully connected to data sink. 25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
Code: Select all
[2012/06/18 09:56:46] Caught SIGTERM, shutting down...
[2012/06/18 09:56:46] Successfully shutdown... (PID=8690)
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:56:47] ndomod: Shutdown complete.
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:56:49] Nagios 3.4.1 starting... (PID=17907)
[2012/06/18 09:56:49] Local time is Mon Jun 18 09:56:49 MDT 2012
[2012/06/18 09:56:49] LOG VERSION: 2.0
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:56:49] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:56:49] ndomod: Successfully connected to data sink. 0 queued items to flush.
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
Please note that we ran the repair on the nagios database only, and not the nagiosql DB. If you believe we should try that also, we can.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 4380
- Joined: Mon Jun 14, 2010 10:21 am
Re: Checks stop running randomly
Yeah, we have been seeing some temporary oddness with ndoutils after the upgrade to 3.x on larger systems. Your culprit is probably right here:
The latest version of ndoutils uses asynchronous writes, so there may have been some oddness on the system at a pretty low level for a little bit. Are you still seeing the inconsistency in the check times?
Code: Select all
[2012/06/18 09:47:27] ndomod: Successfully connected to data sink. 25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
We are.. Even after running an apply, checks still randomly stop. They will run for a while and then exhibit the "freeze" issue where last check and next check are both at the same time, and in the past.
--
Griffin Wakem
Griffin Wakem
-
- Posts: 4380
- Joined: Mon Jun 14, 2010 10:21 am
Re: Checks stop running randomly
Can you send us the grep of your system log?
If ndoutils is dropping any data it would show up there. Can you also post what values you have in /etc/sysctl.conf for the following?
Code: Select all
cat /var/log/messages | grep ndo2db
Code: Select all
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000
-
- Posts: 238
- Joined: Mon Jan 23, 2012 2:02 pm
- Location: Asheville, NC
Re: Checks stop running randomly
Output from the messages.log for the last few days:
The last entry for 07:47 was where someone came in, saw that graphs had stopped, and did an apply.
/etc/sysctl.conf
Code: Select all
Jun 12 17:46:56 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND event_type='0' AND scheduled_time=FROM_UNIXTIME(1339549410) AND recurring_event='0' AND object_id='7625''
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away'
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND scheduled_time<FROM_UNIXTIME(1339549410)'
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away'
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='59756811', lines_processed='5893057', entries_processed='185956' WHERE conninfo_id='1877''
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away'
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
Jun 12 19:03:30 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:33:25 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 14:43:38 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 14:46:32 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:48:18 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 14:52:08 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 14:58:28 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 15:05:16 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 15:10:30 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 15 15:14:45 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jun 18 07:47:19 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away'
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
/etc/sysctl.conf
Code: Select all
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000
--
Griffin Wakem
Griffin Wakem
-
- Posts: 4380
- Joined: Mon Jun 14, 2010 10:21 am
Re: Checks stop running randomly
Try adding the following lines to /etc/my.cnf underneath
Then restart mysqld
It appears as though you're either hitting the max connections for mysql or the connections are timing out. Is your mysql on the same server as XI, or is it offloaded to a 2nd machine? (If it's offloaded, run the above commands on the remote machine).
Code: Select all
[mysqld]
Code: Select all
max_connections=200
connect_timeout=30
Code: Select all
service mysqld restart