Nagios XI 5.8.1 Mysqld higher load and unstable checks
-
- Posts: 25
- Joined: Tue Jun 05, 2018 9:19 am
Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi,
A few days ago we upgraded to Nagios XI 5.8.1 and since then we are experiencing higher cpu load, mainly by the mysqld process.
I wonder if this is normal and other customers are having the same experience.
On the other hand it seems that Nagios XI is now using less memory. It's also less during night hours, so this could be related to the console being open or not.
But the main concern is that the amount of active checks is not that stable since the upgrade: The drops are normal during configuration apply, but it seems to take longer to recover to normal amount.
Also, we are not experiencing any slowness in the console, so that's ok.
We have 2400 host checks and 18400 service checks
Kind regards,
Peter
A few days ago we upgraded to Nagios XI 5.8.1 and since then we are experiencing higher cpu load, mainly by the mysqld process.
I wonder if this is normal and other customers are having the same experience.
On the other hand it seems that Nagios XI is now using less memory. It's also less during night hours, so this could be related to the console being open or not.
But the main concern is that the amount of active checks is not that stable since the upgrade: The drops are normal during configuration apply, but it seems to take longer to recover to normal amount.
Also, we are not experiencing any slowness in the console, so that's ok.
We have 2400 host checks and 18400 service checks
Kind regards,
Peter
You do not have the required permissions to view the files attached to this post.
Last edited by PeterDK on Wed Jan 20, 2021 10:50 am, edited 1 time in total.
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi Peter,
Thanks for posting those graphs. Given that it's higher during the day, how many users are usually logged into the server during working hours? And are these users logged in as admins or user accounts with set permissions?
A few more questions:
Have you made any performance modifications to date? If so, which ones?
Can you download a fresh system profile and send it as PM? It would be ideal to download the profile during high CPU usage so we can take a look at the processes running and top command output. Thanks, Benjamin
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Thanks for posting those graphs. Given that it's higher during the day, how many users are usually logged into the server during working hours? And are these users logged in as admins or user accounts with set permissions?
A few more questions:
Have you made any performance modifications to date? If so, which ones?
Can you download a fresh system profile and send it as PM? It would be ideal to download the profile during high CPU usage so we can take a look at the processes running and top command output. Thanks, Benjamin
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 2320
- Joined: Wed Mar 20, 2013 5:49 am
- Location: Ghent
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi Benjaming,
PeterDK is a colleague of me and is unavailable today. I'll sent you a PM with the profile.
We have about 20 to 25 unique user at peak moments. Yesterdag we had some major network issues, resulting in a onetime peak concurrent users of +- 35. The result was that Nagios had a really hard time, which was very annoying, as we needed it to troubleshoot the network issues... Like you say, it seems like there is a relation to the active check latency and the number of users using the Nagios XI gui.
Therefore I decided to increase the dashlet refresh multiplier (during the network issue yesterday @ 9:05 GMT+1) in performance settings from 1500 to 3000. After making that change the load on Nagios halved and the active check latency decreased. But there is clearly a performance decrease since upgrading to 5.8.1 and I would prefer to leave the dashlet refresh multiplier to max 1500.
(We applied all or most performance enhancing recommendations (ramdisk, gearman workers, php tuning and other tips in https://assets.nagios.com/downloads/nag ... ios-XI.pdf)
Grtz
Willem
PeterDK is a colleague of me and is unavailable today. I'll sent you a PM with the profile.
We have about 20 to 25 unique user at peak moments. Yesterdag we had some major network issues, resulting in a onetime peak concurrent users of +- 35. The result was that Nagios had a really hard time, which was very annoying, as we needed it to troubleshoot the network issues... Like you say, it seems like there is a relation to the active check latency and the number of users using the Nagios XI gui.
Therefore I decided to increase the dashlet refresh multiplier (during the network issue yesterday @ 9:05 GMT+1) in performance settings from 1500 to 3000. After making that change the load on Nagios halved and the active check latency decreased. But there is clearly a performance decrease since upgrading to 5.8.1 and I would prefer to leave the dashlet refresh multiplier to max 1500.
(We applied all or most performance enhancing recommendations (ramdisk, gearman workers, php tuning and other tips in https://assets.nagios.com/downloads/nag ... ios-XI.pdf)
Grtz
Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi @WillemDH,
Thanks for the system profile. There are a number of these entries in the nagios.log, and this could be the result of the high CPU usage by the database.
Also, run the repair script and let me know if you notice any improvement.
Thanks for the system profile. There are a number of these entries in the nagios.log, and this could be the result of the high CPU usage by the database.
Open up the shell and run the following commands,then attach or PM the info.txt file.[1611303744] NDO-3: INSERT INTO nagios_scheduleddowntime (instance_id, downtime_type, object_id, entry_time, author_name, comment_data, internal_downtime_id, triggered_by_id, is_fixed, duration, scheduled_start_time, scheduled_end_time) VALUES (1,1,60900,FROM_UNIXTIME(1554368169),'Claeys Stephen','server for citrix PVS golden image',915527,0,1,-1778430201,FROM_UNIXTIME(1554368106),FROM_UNIXTIME(-224062095)) ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), downtime_type = VALUES(downtime_type), object_id = VALUES(object_id), entry_time = VALUES(entry_time), author_name = VALUES(author_name), comment_data = VALUES(comment_data), internal_downtime_id = VALUES(internal_downtime_id), triggered_by_id = VALUES(triggered_by_id), is_fixed = VALUES(is_fixed), duration = VALUES(duration), scheduled_start_time = VALUES(scheduled_start_time), scheduled_end_time = VALUES(scheduled_end_time)
[1611303744] NDO-3: The following query failed while MySQL appears to be connected:
Code: Select all
mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';" >/tmp/info.txt
echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi >>/tmp/info.txt
Code: Select all
mysqlcheck -f -r -u root -pnagiosxi --databases nagios --use-frm
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 25
- Joined: Tue Jun 05, 2018 9:19 am
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi everyone,
I've attached two txt files with the output of those mysql commands before and after the repair.
After the repair, which fixed some indexes, the output of this command was indeed cleaner.
Meaning no "found block outside..." and "deleted links...". Can't remember exactly what the lines were.
Now every line had status OK.
But the load is still high and the NDO-3 failed queries still occur after every apply.
There is also this forum thread that I created about those NDO-3 logs, as I didn't think it was related, but we had more crashes of the Nagios process after applies lately:
https://support.nagios.com/forum/viewto ... 16&t=61396
I also tested a backup, which completes in the same time as before. But it seems to take a while before the amount of active checks is stable again. Which isn't a real problem as this backup normally runs during the night.
But "apply configuration" happens more often of course.
I've attached two txt files with the output of those mysql commands before and after the repair.
After the repair, which fixed some indexes, the output of this command was indeed cleaner.
Code: Select all
mysqlcheck -f -r -u root -pnagiosxi --databases nagios --use-frm
Now every line had status OK.
But the load is still high and the NDO-3 failed queries still occur after every apply.
There is also this forum thread that I created about those NDO-3 logs, as I didn't think it was related, but we had more crashes of the Nagios process after applies lately:
https://support.nagios.com/forum/viewto ... 16&t=61396
I also tested a backup, which completes in the same time as before. But it seems to take a while before the amount of active checks is stable again. Which isn't a real problem as this backup normally runs during the night.
But "apply configuration" happens more often of course.
You do not have the required permissions to view the files attached to this post.
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
For the queries failing is it only the downtime ones that are failing with 2038 or a really high end year or some other ones?
What is consuming the load?
What is consuming the load?
Code: Select all
ps aux
top -c -n3
-
- Posts: 25
- Joined: Tue Jun 05, 2018 9:19 am
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi,
I think the downtimes were set at 2099, but I can't found those objects anymore.
So, we are wondering were they are coming from.
As mentioned: also check https://support.nagios.com/forum/viewto ... 16&t=61396
It seems that Mariadb is not supporting dates after 2038, but the object Id's for example don't exist anymore. So, why is Nagios still trying to insert them?
The load is mostly coming from mysqld, although I have to say that since we ran the repair databases script again it seems to be doing better.
Before that mysqld was running at 100% and more constantly.
Today after the apply config at 13:00, the Nagios process crashed again and after the restart, mysqld is back at 100% and more.
As you can see in the graph below, the average CPU usage/load is a lot more since the upgrade (18/1). Our main concern now goes to the stability of the active checks.
This graph is from the last 24 hours. The first question mark happens at 12:30, which is 30 minutes before a schedule apply config.
The green mark is normal: during our NDO downgrade, which failed.
This morning, the Nagios scheduled backup started at 3:00 and takes about one hour.
But it seems that it takes a while after that before the active checks are stable again.
The drop at 7:00 could be caused by an auto apply config again.
Other load issues
We also noticed more load on worker nodes since the upgrade:
- we have two nodes doing VMware checks (box293 Monitoring System Worker) These checks are initiated by the NagiosXI server via "check_by_ssh"
- and we also have a gearman node doing network checks These check run locally on a mod-gearman worker node.
As you can see, both graph show higher CPU load since the upgrade.
It seems that Nagios is asking more from these nodes or doing more checks at the same time instead of spreading them over time.
For the VMware checks for example we already raised the number of concurrent checks to 60 and we are still hitting that limit sometimes.
Before that it was set to 45 and we only hit that limit after a reboot of the Nagios server itself.
Kind regards,
Peter
I think the downtimes were set at 2099, but I can't found those objects anymore.
So, we are wondering were they are coming from.
As mentioned: also check https://support.nagios.com/forum/viewto ... 16&t=61396
It seems that Mariadb is not supporting dates after 2038, but the object Id's for example don't exist anymore. So, why is Nagios still trying to insert them?
The load is mostly coming from mysqld, although I have to say that since we ran the repair databases script again it seems to be doing better.
Before that mysqld was running at 100% and more constantly.
Today after the apply config at 13:00, the Nagios process crashed again and after the restart, mysqld is back at 100% and more.
As you can see in the graph below, the average CPU usage/load is a lot more since the upgrade (18/1). Our main concern now goes to the stability of the active checks.
This graph is from the last 24 hours. The first question mark happens at 12:30, which is 30 minutes before a schedule apply config.
The green mark is normal: during our NDO downgrade, which failed.
This morning, the Nagios scheduled backup started at 3:00 and takes about one hour.
But it seems that it takes a while after that before the active checks are stable again.
The drop at 7:00 could be caused by an auto apply config again.
Other load issues
We also noticed more load on worker nodes since the upgrade:
- we have two nodes doing VMware checks (box293 Monitoring System Worker) These checks are initiated by the NagiosXI server via "check_by_ssh"
- and we also have a gearman node doing network checks These check run locally on a mod-gearman worker node.
As you can see, both graph show higher CPU load since the upgrade.
It seems that Nagios is asking more from these nodes or doing more checks at the same time instead of spreading them over time.
For the VMware checks for example we already raised the number of concurrent checks to 60 and we are still hitting that limit sometimes.
Before that it was set to 45 and we only hit that limit after a reboot of the Nagios server itself.
Kind regards,
Peter
You do not have the required permissions to view the files attached to this post.
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
I believe these are taken from /usr/local/nagios/var/retention.dat, so you should be able to:
Then delete the downtime(s) that are above 2038 in your /usr/local/nagios/var/retention.dat(it's at the bottom) and then start nagios back up:
Then re-add the downtime in the web interface with something less than 2038.
Development would need to change the way the code is written to not use FROM_UNIXTIME.
Do you see those errors over and over? I'm wondering if that's the reason that you're seeing the mysqld high load.
Code: Select all
systemctl stop nagios
Code: Select all
systemctl start nagios
Development would need to change the way the code is written to not use FROM_UNIXTIME.
Do you see those errors over and over? I'm wondering if that's the reason that you're seeing the mysqld high load.
-
- Posts: 25
- Joined: Tue Jun 05, 2018 9:19 am
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Hi,
continuing on the post from @ssax in https://support.nagios.com/forum/viewto ... 16&t=61396
We managed to downgrade to NDO 2, but even running to following as root failed:
The issue on our server was that "./install" runs ./configure which makes a "Makefile" from Makefile.in" I suppose.
On line 49, there were 2 extra "xinetd" entries under INETD_TYPE=xinetd.
After removing those 2 lines, the following commands "make all" and "make install-init" worked fine.
So, I did the other part of the install script also manually.
NDO seems to be running fine now, but we will follow up until tomorrow.
Already 2 remarks:
1. We now see some scheduled downtimes twice, so a cleanup was needed.
2. The main reason for mysqld to be so high in CPU usage with NDO 3 was due to 2 heavy queries that were running after each apply or reboot.
In mysql, run the following command:
Those two queries took 200% CPU
UPDATE nagios_commenthistory was running the longest and when UPDATE nagios_downtimehistory stopped, the CPU dropped to 100%
When counting the entries in the commenthistory table there were more than 2 million records and only about 800 records in the comments table.
For most of the history entries, the comment or even object_id didn't exist anymore.
I'm wondering of a cleanup of the history table would speed up the update queries.
But why is the update of commenthistory needed, anyway?
I will try this cleanup on our DR environment with a backup from production.
Kind regards,
Peter
continuing on the post from @ssax in https://support.nagios.com/forum/viewto ... 16&t=61396
We managed to downgrade to NDO 2, but even running to following as root failed:
Code: Select all
systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi
./init.sh
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
On line 49, there were 2 extra "xinetd" entries under INETD_TYPE=xinetd.
After removing those 2 lines, the following commands "make all" and "make install-init" worked fine.
So, I did the other part of the install script also manually.
NDO seems to be running fine now, but we will follow up until tomorrow.
Already 2 remarks:
1. We now see some scheduled downtimes twice, so a cleanup was needed.
2. The main reason for mysqld to be so high in CPU usage with NDO 3 was due to 2 heavy queries that were running after each apply or reboot.
In mysql, run the following command:
Code: Select all
show processlist;
UPDATE nagios_commenthistory was running the longest and when UPDATE nagios_downtimehistory stopped, the CPU dropped to 100%
When counting the entries in the commenthistory table there were more than 2 million records and only about 800 records in the comments table.
For most of the history entries, the comment or even object_id didn't exist anymore.
I'm wondering of a cleanup of the history table would speed up the update queries.
But why is the update of commenthistory needed, anyway?
I will try this cleanup on our DR environment with a backup from production.
Kind regards,
Peter
You do not have the required permissions to view the files attached to this post.
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Nagios XI 5.8.1 Mysqld higher load and unstable checks
Glad you got the sorted, it's the first I've seen that error with the inetd stuff.
For the queries:
The tables aren't really that large but if for whatever reason it's not fast enough you can clean it up, see here for a FAQ I wrote:
FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?
You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.
NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password
If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):
nagios_logentries - Impacts Event Log report length
nagios_statehistory - Impacts the State History report length
nagios_notifications - Impacts the Notifications report length
nagios_commenthistory - Impacts the comment history
These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they ARE crashed, you will need to repair the database FIRST in order to run these queries):
nagios_logentries - Impacts Event Log report length
nagios_statehistory - Impacts the State History report length
nagios_notifications - Impacts the Notifications report length
nagios_commenthistory - Impact the comment history
Then you should go to Admin > Performance Settings > Databases tab and adjust ALL of the retention intervals to meet your business data policy standards to keep them cleaned up as these settings are for adjusting the retention on those DB tables.
I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.
See here for more information:
https://assets.nagios.com/downloads/nag ... os-XI.pdf
And here:
https://assets.nagios.com/downloads/nag ... abase.pdf
For the queries:
The tables aren't really that large but if for whatever reason it's not fast enough you can clean it up, see here for a FAQ I wrote:
FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?
You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.
NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password
If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):
nagios_logentries - Impacts Event Log report length
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_logentries;'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_statehistory;'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_notifications;'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_commenthistory;'
These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they ARE crashed, you will need to repair the database FIRST in order to run these queries):
nagios_logentries - Impacts Event Log report length
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_statehistory WHERE state_time <= (NOW() - INTERVAL 6 MONTH);'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_notifications WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'
Code: Select all
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_commenthistory WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'
I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.
See here for more information:
https://assets.nagios.com/downloads/nag ... os-XI.pdf
And here:
https://assets.nagios.com/downloads/nag ... abase.pdf