MySQL Database Crashes

mguthrie · Post by **mguthrie** » Wed Jul 18, 2012 9:29 am

Apologies for the delayed response. I asked our lead developer to look at this with us and see if he has any ideas, I haven't seen this particular set of symptoms before.

Just to rule it out as a possibility, could there be any relationship between when the database maintenance runs are being done and that load spike? In the Admin->Performance settings->Database(tab), there are settings for keeping the tables trimmed and how often they run.

Do you still happen to have that log capture that shows what happens during the 4 minutes after midnight?
Anything revealing in either the nagios.log or the syslog around that time?

Post by **gwakem** » Wed Jul 18, 2012 10:36 am

I dont have a dump of the SQL queries that occurred, but I do show at exactly midnight (in both the messages.log and nagios.log,) we get this message:

Code: Select all

Jul 18 00:00:00 sidhqmonm0 nagios: LOG ROTATION: DAILY 
Jul 18 00:00:00 sidhqmonm0 nagios: LOG VERSION: 2.0 
Jul 18 00:00:00 sidhqmonm0 nagios: CURRENT HOST STATE: IMMAHOST;UP;HARD;1;OK - 10.14.1.141: rta 1.594ms, lost 0% 
Jul 18 00:00:00 sidhqmonm0 nagios: CURRENT HOST STATE: METOOMETOO;UP;HARD;1;OK - 10.14.1.142: rta 1.894ms, lost 0%

it takes 3 minutes and 14 seconds for all hosts and services to finish registering "CURRENTHOST/SERVICE STATE", during which time no checks are run, followed immediately by the geaman workers and gearmand reporting stale entries (which makes sense, since no checks were occurring during this time):

Code: Select all

Jul 18 00:03:14 sidhqmonm0 nagios: Warning: The results of service 'Mod_Gearman Workers' on host 'sidhqmonc2' are stale by 0d 0h 2m 24s (threshold=0d 0h 5m 15s).  I'm forcing an immediate check of the service.
Jul 18 00:03:14 sidhqmonm0 nagios: Warning: The results of service 'Mod_Gearman Eventhandlers' on host 'sidhqmonm0' are stale by 0d 0h 2m 24s (threshold=0d 0h 5m 15s).  I'm forcing an immediate check of the service.
Jul 18 00:03:14 sidhqmonm0 nagios: Warning: The results of service 'Mod_Gearman Host Queue' on host 'sidhqmonm0' are stale by 0d 0h 2m 28s (threshold=0d 0h 5m 15s).  I'm forcing an immediate check of the service.
Jul 18 00:03:14 sidhqmonm0 nagios: Warning: The results of service 'Mod_Gearman Service Queue' on host 'sidhqmonm0' are stale by 0d 0h 2m 9s (threshold=0d 0h 5m 15s).  I'm forcing an immediate check of the service.

The Admin > Performance Settings > Database (under NDOUtils Database) are set to:

Code: Select all

Max External Commands Age: 7
Max time in DAYS to keep external commands.
Max Log Entries Age:	 90
Max time in DAYS to keep log entries.
Max Notifications Age: 90
Max time in DAYS to keep notifications.
Max State History Age: 730
Max time in DAYS to keep state history information .
Max Timed Events Age: 5
Max time in minutes to keep timed events.
Max System Commands Age: 5
Max time in minutes to keep system commands.
Max Service Checks Age: 5
Max time in minutes to keep service checks.
Max Host Checks Age: 5
Max time in minutes to keep host checks.
Max Event Handlers Age: 5
Max time in minutes to keep event handlers.
Optimize Interval: 60
Max time in minutes between optimization runs.

I show that the nagios process doesn't restart, so I'm stumped as to why we suddenly have a re-registering of service and host state. We know that this doesn't seem to hit the retention.dat, but makes direct queries to the database for everything, all at once, completely overwhelming it. We found that if we script a mysqld restart when we hit over a 5 minute load of 10 (which occurs in the span of 1 minute,) it clears the thread queue and MySQL suddenly isn't slammed anymore. The load drops, and everything goes great. If we don't restart mysql quickly however, all checks and graphing lock up until we get in and restart mysql and reboot the nagios server itself. Just applying gives us some serious gearman orphaned check alerts, and the scheduler appears funky.

I dont know whats causing the log rotation, as I dont see anything in nagios's crontab, root's crontab, /etc/cron.d/, /etc/dron.daily (which is set to run at 04:02 anyway,) etc. I cant find the culprit of the log rotation to see what its doing and possibly disable it.

mguthrie · Post by **mguthrie** » Wed Jul 18, 2012 11:02 am

Looks like log rotation for is actually handled by the nagios daemon itself, and can be configured in the main nagios.cfg:

http://nagios.sourceforge.net/docs/3_0/ ... ion_method

I'm guessing that it's some sort of blocking process while it runs, but that still doesn't make sense to me why all host/service states are being rechecked at midnight...

mguthrie · Post by **mguthrie** » Wed Jul 18, 2012 11:04 am

Do you see anything odd in the mod gearman logs around that time?

Post by **gwakem** » Wed Jul 18, 2012 12:47 pm

/facepalm.. I cant believe I overlooked that. I disabled the log rotation so I can see the results tonight.

gearmand is set to only log errors, and it doesn't show any, but I'll throw on debug for tonight and see what we get. I show that at midnight the gearman_worker restarted on all children, and didnt receive any new jobs until after the services and hosts stopped registering at the 3:17 mark. The checks they had previously were most likely wiped by the restart, leading to the "gearman stale" log entries. The gearman_worker did have debug on (not super debug, but basic debug,) and that was all it showed.

If we get no load tonight, we know the load on MySQL is due to the log rotation. If we still get it, then at least we have some debug from gearmand. Tomorrow night, if the problem still exists, Ill enable debug on the nagios daemon before midnight and see if that gets us any info.

Post by **gwakem** » Thu Jul 19, 2012 9:19 am

We isolated the issue.. the load spike seems to only happen after an internal logrotate. I disabled it last night in the nagios.cfg and am attaching appropriately named screenshots:

Before:

Before.jpg

After:

After.jpg

We will test to see what happens if we rotate the log through other means (eg. logrotate) but while it would surprise me, do you know if this rotation does anything to the retention.dat?

scottwilkerson · Post by **scottwilkerson** » Thu Jul 19, 2012 10:02 am

gwakem wrote: We will test to see what happens if we rotate the log through other means (eg. logrotate) but while it would surprise me, do you know if this rotation does anything to the retention.dat?

I don't believe so but you could have been experiencing such a high IO that nothing else could get access to the drive...

Post by **gwakem** » Thu Jul 19, 2012 10:48 am

We monitor I/O on the Nagios Parent via a check executed from the children, and IO was actually lower than during the day. I double checked with SAR and it confirmed the same, but SAR checks at 10 minute intervals, while the nagios check happens at 5 minute intervals, so its entirely possible both checks could have missed an I/O load spike, though I haven't seen any evidence that such a spike is occurring.

But even so, we isolated the reason for the load spike on the DB to the re-registering of items triggered by the log rotation. So even if there was some kind of a spike preventing access to files, it would only have been triggered by the fact that nagios didn't hit the retention.dat first, and instead flooded the database.

mguthrie · Post by **mguthrie** » Thu Jul 19, 2012 12:56 pm

I think I've got it: log_initial_states
http://nagios.sourceforge.net/docs/3_0/ ... ial_states

When the log gets rotated, there's a function that dumps/checks all current states to the log file if this option is enabled. I'm betting that this is bombing both Nagios and Ndoutils.

Post by **gwakem** » Thu Jul 19, 2012 1:05 pm

I was looking at that earlier, but it's set to 0.

Code: Select all

log_archive_path=/usr/local/nagios/var/archives
log_external_commands=1   <----- We just changed this an hour ago for unrelated purposes (Pesky users disabling checks! For shame!)
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0  <----- We haven't changed this
log_notifications=1
log_passive_checks=0
log_rotation_method=n  <------ Changed yesterday
log_service_retries=1

Nagios Support Forum

MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes

Re: MySQL Database Crashes