Service Check Timeouts

Dusan.Mandic · Post by **Dusan.Mandic** » Wed Jan 05, 2022 7:13 pm

Hello all,

Having a seemingly recurring issue with service checks timing out and causing notifications. This seems to be occuring on multiple hosts. Our load seems to be in the 30's as well (16 core VM), which is probably incurring the situation.

Attached is profile.

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

ssax · Post by **ssax** » Thu Jan 06, 2022 1:45 pm

I see this:

[Wed Jan 05 03:01:50.206051 2022] [:error] [pid 596] [client X.X.X.X:59566] PHP Warning: mysqli::mysqli(): (08004/1040): Too many connections in /usr/local/nagiosxi/html/includes/components/opscreen/merlin.php on line 25, referer: https://XXXXXXX/nagiosxi/includes/compo ... screen.php

Please add these under the [mysqld] section of your /etc/my.cnf:

Code: Select all

[mysqld]
max_allowed_packet=512M
max_connections=1000

Then restart these services:

Code: Select all

systemctl restart mariadb nagios httpd crond

If that doesn't alleviate it, it may be related to Trend Micro, that's quite a bit of CPU being used by it (305.6%), I would disable it as a test and see if that helps resolve your issue:

Code: Select all

 6654 root      20   0 9768056 361804  40084 S 305.6  1.1   8087:20 ds_am

It is likely interfering with how fast things need to go and processes/jobs/checks are getting queued up and timing out.

Please PM the output of these commands as root/sudo:

Code: Select all

sar -A
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache

Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot, and -pnagiosxi in the command if you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table

Dusan.Mandic · Post by **Dusan.Mandic** » Wed Jan 12, 2022 12:50 pm

It seems to only time out on certain service checks from 10.200.247.xxx, 10.200.235.xxx and 10.200.249.xxx. Is there any way to isolate why these timed out?

Post by **pbroste** » Thu Jan 13, 2022 3:01 pm

Hello @Dusan.Mandic

@ssax is out of the office this week and want to follow up with you on this on his behalf.

From your previous response it sounds like you want to verify events from within the address ranges listed:

Code: Select all

grep -Ei "10.200.247.[0-9]{3}|10.200.235.[0-9]{3}|10.200.249.[0-9]{3}" /usr/local/nagiosxi/var/*.log --color=always | less -SR

Read through the results and let us know if you see anything that sticks out or is incommon.

Thanks,
Perry

Dusan.Mandic · Post by **Dusan.Mandic** » Thu Feb 03, 2022 5:41 pm

Still experiencing timeouts from the same hosts. Was able to pare down our vROPs polling to bring the server load down (API request reduction), so I now know its not proc cycles

would you like another profile sent @ssax?

Post by **pbroste** » Fri Feb 04, 2022 3:54 pm

Hello @Dusan.Mandic

Thanks for following up, I will ping @ssax, and let him know that you are going to send an updated System Profile to his Private Message inbox.

Thanks,
Perry

ssax · Post by **ssax** » Tue Feb 08, 2022 7:40 pm

If it's the same hosts that are timing out, does it occur on a consistently periodic fashion at around the same times?

Is there any consistency to the timing of them failing?

Since it's the same ones in the same subnets, are you checking them over a VPN tunnel that could be having issues/routing issues? (VPN tunnels can bet setup by subnet or by host so if it's a subnet based VPN tunnel and it dropped/re-established it would take down all hosts in that tunnel as an example)

If it happens during the same times, check backups, vmotions, off-server jobs, vulnerability scanning, etc that could be causing the systems OR the tunnel/interfaces in the network path to overload and drop packets. I've seen all of those take down systems like that in a network. I worked at a place that had an old router that when we implemented vulnerability scanning and it scanned the remote systems it would overload the router interface (too much data for the old hardware) and the cause connectivity issues.

Those are some good places to start. I would also check the network statistics on the network device interfaces in the path, maybe sometimes you get an invalid route/asymmetric routing if you're using some type of protocol such as bgp/eigrp/ospf.

If it was network issues globally with the XI server you'd be having other hosts/services with the same issues so it's likely external to the XI server causing it.

You can generally increase the plugin timeouts to account for it but you may need to investigate the network path to determine where the failure is coming from if it's impacting entire subnets.

Dusan.Mandic · Post by **Dusan.Mandic** » Tue Feb 15, 2022 11:37 am

Theres no correlation concerning timing that i can see, but it seems to be the check_Unanswered Messages in MSG Queue (check AS400 msg plugin) across those 4 hosts. I would think if it was a network issue, we would see different service checks dropping, not just that one. Can someone please look into the profile i sent for service check timeouts concerning that service?

Our network is all internal, and most of the timeouts occur without rhyme or reason. The firewall is all open, and i dont see any drops anywhere

ssax · Post by **ssax** » Wed Feb 16, 2022 11:02 am

I see this consuming a lot of CPU:

6654 root 20 0 9768056 361804 40084 S 305.6 1.1 8087:20 ds_am

Please try disabling that deep security agent and see if that is slowing down your checks and causing them to hit a limit and timeout. The assumption is that everything that nagios is doing is slowed down by the agent scanning for threats. That would be my first guess based on what you're saying. If that resolves it you would either need to contact the agent vendor and ask them what can be done or increase the timeouts on your checks.

I see these as well (will cause gaps in your graphs):

[01-05-2022 18:05:30] NPCD: WARN: MAX load reached: load 48.040000/10.000000 at i=1
[01-05-2022 18:05:45] NPCD: WARN: MAX load reached: load 49.040000/10.000000 at i=1
[01-05-2022 18:06:00] NPCD: WARN: MAX load reached: load 47.240000/10.000000 at i=1
[01-05-2022 18:06:15] NPCD: WARN: MAX load reached: load 50.650000/10.000000 at i=1

2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Timeout after 20 secs. ***
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Please check your npcd.cfg

Please follow this guide to set your load_threshold to 80.0 and your TIMEOUT to 40:

https://support.nagios.com/kb/article.php?id=9

Please send the output of these commands:

Code: Select all

ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache
netstat -s
ethtool -S eth0

Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the command if you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table

Dusan.Mandic · Post by **Dusan.Mandic** » Wed Feb 16, 2022 4:15 pm

Are you using the new profile I sent in PM? The load issue has been resolved.

Confirmed with networking were not capping our threshold limits, same with IOPS for storage.

Please confirm you are using the latest profile, created 2/8/2022

Nagios Support Forum

Service Check Timeouts

Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts

Re: Service Check Timeouts