Nagios xi: Active Service Checks Performance SLOW

dgianetti · Post by **dgianetti** » Fri Jun 11, 2021 9:41 am

We are currently running Nagios xi 5.7.5 on RHEL 6.10. I know, I know, it's an old OS. We are looking to upgrade it sometime soon.

Anyhow, a while back, a teammate noticed the server was extremely slow and checks were failing randomly. When we stopped Nagios xi, performance went back to normal. We upgraded and migrated everything to MySQL from Postgresql and things were better for a time. Recently, this issue returned. Again, stopping Nagios xi makes the server perform normally again. I went so far as to notice the performance immediately returned when I stopped the nagios service itself. I then focused on that and found that when I stop Active Service checks within the Nagios xi Monitoring Engine Process pane, everything would work fine (except active checks are disabled). It's a relatively small implementation of 15 hosts and 320 services, so it's not a capacity issue. One thing I noticed was the process to stop/start Nagios xi mentioned 'service ndo2db stop' and start. When I run this command I get an unrecognized service error from RHEL. After digging in to ndo2db, it seems that might be the piece that's an issue, but I'm stuck trying to troubleshoot further. Am I way off base? What else can I check?

benjaminsmith · Post by **benjaminsmith** » Fri Jun 11, 2021 12:39 pm

Hi @dgianetti,

Starting in 5.7.x the backend database application that writes the check results to a MySQL database was completely re-written and no longer runs as a separate service.

Given that this is an older, legacy system and the overall check load (host + services) is small, I believe it would be better to use the other version of the database application. It's a relatively simple process to change from one to the other. However, I would recommend taking a full backup or VM snapshot before making any changes. Please try the following steps and let me know if you notice an improvement.

Code: Select all

systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db

Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:

Code: Select all

broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

Make sure this line is commented:

Code: Select all

#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg

Then start the nagios service:

Code: Select all

systemctl start nagios

If you continue to have issues, please send over the system profile to help us troubleshoot. Also, just so you know, RHEL 6 is EOL and not supported, let us know if you having questions about migrating this instance to RHEL 7 or 8 (or another supported distro).

To send us your system profile.
Login to the Nagios xi GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

--Benjamin

dgianetti · Post by **dgianetti** » Mon Jun 14, 2021 1:40 pm

Thanks very much, Benjamin. It appears things are working again, but we used to see OK performance for a few hours before it would start degrading again. Everything would take minutes to process... simple commands like SU or HTOP, or even waiting for the command prompt to appear. When Nagios service was shut down, everything would go back to normal. I also noticed the service execution (max and avg) and the service check latency would skyrocket to 1200 or 1300 seconds!

As of right now, I reinstalled ndo2db, updated the config and started things back up. It appears checks are again running and the AVG/MAX times are going down to what we'd expect.

Thanks for the reminder about RHEL 6.x. We are aware and trying to get some buy-in from our management to upgrade the server and reinstall nagios. This is our oldest (original) installation and so has been through many upgrades. A fresh start is in order.

Thanks again for your help. I'll update if anything goes awry.

Dave

dgianetti · Post by **dgianetti** » Mon Jun 14, 2021 3:50 pm

It's been a few hours and unfortunately, the system is back to its old tricks. Initially, I watched as the Max and Avg times fell for service check execution time. I just went in to check on it and I'm seeing the same latency on the system when trying to do anything at all. Sure enough, the service execution time has crept back up to 36 sec AVG and 316sec MAX execution time. The service check latency is back up to 127 sec Avg 419 sec max! It seems something still isn't correct.

You mentioned sending the System Profile. How should I get that to you? PM? Email? I have created it just now.

Thanks again!

benjaminsmith · Post by **benjaminsmith** » Tue Jun 15, 2021 9:46 am

Hi,

The best way would be to send it over in a private message, click the PM icon below my name. Try doing a full-stack re-start and let me know if the issues goes away (at least temporarily). The latency shouldn't be an issue on a system of this size. Also, when the system is loaded down, please run a top command and post it to the thread as well. Thanks, Benjamin

Code: Select all

service crond stop
service npcd stop
service nagios stop
service ndo2db stop
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
service mysqld restart
service ndo2db start
service nagios start
service npcd start
service crond start

To send us your system profile.
Login to the Nagios xi GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

benjaminsmith · Post by **benjaminsmith** » Wed Jun 16, 2021 5:16 pm

Hi,

Thanks for the profile, I'm still seeing a few database-related errors.

/usr/libexec/mysqld: Forcing close of thread 2128990 user: 'ndoutils'

Please follow the steps in the kb article below to increase the max connections on the database and let me know if you notice an improvement.

Nagios xi - MySQL/MariaDB - Max Connections

The system is small but there are only 2 CPU's. I would recommend increasing this to 4 for better performance.

Nagios xi - Hardware Requirements
https://support.nagios.com/kb/article.p ... ategory=83

Benjamin

dgianetti · Post by **dgianetti** » Thu Jun 17, 2021 2:01 pm

I just ran the commands in the article you linked. Looks like the default is set for DB connections (151) and the Max used was 78. That doesn't seem to be too many. I can increase if you like, please let me know.

The issue seems to occur when I enable the active service checks. Through process of elimination, we began stopping and starting services relating to Nagios (when the issue first occurred). First we found stopping Nagios returned the server to normal response time. Then we found disabling Active Service Checks via the Nagios xi GUI seemed to solve the issue. I'm trying to isolate that futher, but that's where I'm getting stuck. Is there some kind of debug that can be enabled to see if there is a particular check that's hanging, or something similar? It's maddening for sure.

benjaminsmith · Post by **benjaminsmith** » Fri Jun 18, 2021 10:02 am

Hi,

I went through the profile as well as another team member and the only other concern was the following entry.

[1623695088] WARNING: RLIMIT_NPROC is 4096, total max estimated processes is 4106! You should increase your limits (ulimit -u, or limits.conf)

ON a system of this size the limits are usually not a concern, but please log in as the nagios user and post the limits to check that.

Code: Select all

su - nagios
ulimit -a

Here's a tutorial on how to increase the nproc limits.
https://www.thegeekdiary.com/how-to-set ... -rhel-567/

Keep an eye on the nagios.log and database logs for any errors. You might want to disable checks temporarily for any hosts/services that are unreachable (timing out) and see if that helps.

Code: Select all

/usr/local/nagios/var/nagios.log
/var/log/mysqld.log

--Benjamin

dgianetti · Post by **dgianetti** » Sat Jun 19, 2021 11:34 am

Thanks for the additional information. I'm actually out of office this week, but will pass this along to a teammate. Doesn't hitting the 4096 proc limit on our relatively small implementation seem strange? We have a much larger implementation in another network and that is humming along fine with the same limits.

How much larger than 4096 can you go? I was searching for answers to this, but I haven't seen anything to indicate what the practical limit is.

ssax · Post by **ssax** » Mon Jun 21, 2021 11:16 am

10000 should be good.

When your system is having issues, does the top command show a lot of IO wait? It's the wa column below:

[root@xid ~]# top
top - 09:14:47 up 4 days, 23:14, 2 users, load average: 0.43, 0.61, 0.52
Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.1 us, 3.1 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

What is the output of this command?

Code: Select all

sar -A

Additionally, please send the output of this command:
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table

Please attach these files:

Code: Select all

/etc/php.ini
/etc/my.cnf

Nagios Support Forum

Nagios xi: Active Service Checks Performance SLOW

Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW

Re: Nagios xi: Active Service Checks Performance SLOW