Nagios xi Jobs Critical

Post by **gwakem** » Thu Jun 21, 2012 4:36 pm

We seem to be getting this alert from out Nagios xi Jobs check created by the wizard:

Database Maintenance (dbmaint) stale (629 seconds old)

This happens pretty regularly, but then clears. During the time that the alert is in effect, we have several web interface portions go unresponsive or stale. (Nagvis shows errors connectiong to NDO, etc). I see we get MySQL Long running processes graphed, but those are limited to about 1 and also clear quickly. I don't know if these are related.

Any idea of where one would start looking?

mguthrie · Post by **mguthrie** » Fri Jun 22, 2012 9:52 am

Try manually running the following script and see what output you get.

Code: Select all

/usr/local/nagiosxi/cron/dbmaint.php

This script automatically trims tables based on the parameters defined in the Admin->Performance menu.

Post by **gwakem** » Fri Jun 22, 2012 10:01 am

That ran with no issues. Perhaps I should lower the optimize interval? I get the feeling that this is a symptom of a larger issue, but I cant begin to think of where. Our load average is now super low (2.5,7.2,7.5) post BPI upgrade, and resources seem to be great on the box. Similarly, on the MySQL server, things are all rainbows and unicorns. Its a MySQL utopia over there.

mguthrie · Post by **mguthrie** » Fri Jun 22, 2012 10:32 am

For larger systems that might be a good idea. Generally the more you can keep your tables trimmed and optimized the better your system will perform as you scale larger.

It's possible that the optimization run was timing out on the cron occasionally, which may have held up DB requests from other scripts. Hard to say for sure though.

The log from the xi subsystem cron scripts are all located at /usr/local/nagiosxi/var/<scriptname>.log. So if you see this issue reoccur you can do a running tail on the dbmaint.log and see the debug output of what's going on.

Post by **gwakem** » Fri Jun 22, 2012 11:01 am

Ahh, goldmine, thanks. On a side note, it appears that these logs are being overwritten every time it runs, not appended to. We altered that in our cron to get an idea of whats happening, but I didn't know if that was intended or not.

Post by **KevinD** » Fri Jun 22, 2012 11:45 am

Mr Guthrie,

Thank you for the head start, it gave us what to look for.

So this is due to the table optimization in mysql.
It's stalling out when deleting from nagios_logentries (910K rows) & nagios_statehistory (446K rows).

This is because its copying the whole table to tmp space, deleting the rows, then copying back row for row to prevent fragmentation.
This is further complicated due to the indexes on those tables (which don't make much sense to me).

nagios_logentries

Code: Select all

+-------------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table             | Non_unique | Key_name        | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| nagios_logentries |          0 | PRIMARY         |            1 | logentry_id     | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | logentry_time   |            1 | logentry_time   | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | logentry_data   |            1 | logentry_data   | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | instance_id     |            1 | instance_id     | A         |          18 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | instance_id_2   |            1 | instance_id     | A         |          18 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | logentry_time_2 |            1 | logentry_time   | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | entry_time      |            1 | entry_time      | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
| nagios_logentries |          1 | entry_time_usec |            1 | entry_time_usec | A         |      916508 |     NULL | NULL   |      | BTREE      |         | 
+-------------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+

We have 8 different indexes on this table, which have no relation to each other, if they are being used, great, but if not, this is going to hamper updates and deletes on this scale, as it has to alter every index.

instance_id & instance_id_2 are identical, down to the col & row count.
As are logentry_time and logentry_time_2.
Whilst I don't know what these were originally built for, logentry_time & logentry_data look to be subsets of data, and should probably be one index.

nagios_statehistory

Code: Select all

+---------------------+------------+--------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table               | Non_unique | Key_name     | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------------------+------------+--------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| nagios_statehistory |          0 | PRIMARY      |            1 | statehistory_id | A         |      455008 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | state_time   |            1 | state_time      | A         |      455008 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | object_id    |            1 | object_id       | A         |        5986 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | instance_id  |            1 | instance_id     | A         |          16 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | instance_id  |            2 | object_id       | A         |       15166 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | state_time_2 |            1 | state_time      | A         |      455008 |     NULL | NULL   |      | BTREE      |         | 
| nagios_statehistory |          1 | state_time_2 |            2 | state_time_usec | A         |      455008 |     NULL | NULL   |      | BTREE      |         | 
+---------------------+------------+--------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+

This table only has 5 separate indexes,
state_time & state_time_2 look like they were supposed to be a replacement for the former of the two.

This may be an artifact from the DB update issue we had with the previous ticket, but it does not look like it dropped indexes before it created new ones.
If you guys sign off on it, I would like to drop the duplicate indexes, which should give us some speed back as well as just cleaning up after ourselves.

mguthrie · Post by **mguthrie** » Fri Jun 22, 2012 3:01 pm

Yeah, I think something went haywire somewhere in that update process. I'd consider just dropping the whole table and recreating. Here's the original table creation query that is used to build that table on a clean install:

Code: Select all

CREATE TABLE IF NOT ExiSTS `nagios_logentries` (
  `logentry_id` int(11) NOT NULL auto_increment,
  `instance_id` int(11) NOT NULL default '0',
  `logentry_time` datetime NOT NULL default '0000-00-00 00:00:00',
  `entry_time` datetime NOT NULL default '0000-00-00 00:00:00',
  `entry_time_usec` int(11) NOT NULL default '0',
  `logentry_type` int(11) NOT NULL default '0',
  `logentry_data` varchar(255) character set latin1 NOT NULL default '',
  `realtime_data` smallint(6) NOT NULL default '0',
  `inferred_data_extracted` smallint(6) NOT NULL default '0',
  PRIMARY KEY  (`logentry_id`)
) ENGINE=MyISAM COMMENT='Historical record of log entries';

mguthrie · Post by **mguthrie** » Fri Jun 22, 2012 3:29 pm

It appears this has uncovered a bug. Upon closer inspection, there are certain circumstances where some ALTER TABLE queries can be run against the logentries table, and if those columns already exist, mysql will add new columns. Apparently its a mysql "feature." Nice catch! We'll get that updated for the 3.2 release.

[Edit]: Looks like this could show up in a few other tables as well, if you see these duplicate indexes, go ahead and delete them. We'll work in finding them and fixing this for the next upgrade.

Nagios Support Forum

Nagios xi Jobs Critical

Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical

Re: Nagios xi Jobs Critical