Hello all,
Having a seemingly recurring issue with service checks timing out and causing notifications. This seems to be occuring on multiple hosts. Our load seems to be in the 30's as well (16 core VM), which is probably incurring the situation.
Attached is profile.
Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.
Service Check Timeouts
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Service Check Timeouts
I see this:
Then restart these services:
If that doesn't alleviate it, it may be related to Trend Micro, that's quite a bit of CPU being used by it (305.6%), I would disable it as a test and see if that helps resolve your issue:
It is likely interfering with how fast things need to go and processes/jobs/checks are getting queued up and timing out.
Please PM the output of these commands as root/sudo:
Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot, and -pnagiosxi in the command if you've changed the root mysql password
Please add these under the [mysqld] section of your /etc/my.cnf:[Wed Jan 05 03:01:50.206051 2022] [:error] [pid 596] [client X.X.X.X:59566] PHP Warning: mysqli::mysqli(): (08004/1040): Too many connections in /usr/local/nagiosxi/html/includes/components/opscreen/merlin.php on line 25, referer: https://XXXXXXX/nagiosxi/includes/compo ... screen.php
Code: Select all
[mysqld]
max_allowed_packet=512M
max_connections=1000
Code: Select all
systemctl restart mariadb nagios httpd crond
Code: Select all
6654 root 20 0 9768056 361804 40084 S 305.6 1.1 8087:20 ds_am
Please PM the output of these commands as root/sudo:
Code: Select all
sar -A
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache
- NOTE: You may need to adjust the -uroot, and -pnagiosxi in the command if you've changed the root mysql password
Code: Select all
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table
-
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: Service Check Timeouts
It seems to only time out on certain service checks from 10.200.247.xxx, 10.200.235.xxx and 10.200.249.xxx. Is there any way to isolate why these timed out?
-
- Posts: 1288
- Joined: Tue Jun 01, 2021 1:27 pm
Re: Service Check Timeouts
Hello @Dusan.Mandic
@ssax is out of the office this week and want to follow up with you on this on his behalf.
From your previous response it sounds like you want to verify events from within the address ranges listed:
Read through the results and let us know if you see anything that sticks out or is incommon.
Thanks,
Perry
@ssax is out of the office this week and want to follow up with you on this on his behalf.
From your previous response it sounds like you want to verify events from within the address ranges listed:
Code: Select all
grep -Ei "10.200.247.[0-9]{3}|10.200.235.[0-9]{3}|10.200.249.[0-9]{3}" /usr/local/nagiosxi/var/*.log --color=always | less -SR
Thanks,
Perry
-
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: Service Check Timeouts
Still experiencing timeouts from the same hosts. Was able to pare down our vROPs polling to bring the server load down (API request reduction), so I now know its not proc cycles
would you like another profile sent @ssax?
would you like another profile sent @ssax?
-
- Posts: 1288
- Joined: Tue Jun 01, 2021 1:27 pm
Re: Service Check Timeouts
Hello @Dusan.Mandic
Thanks for following up, I will ping @ssax, and let him know that you are going to send an updated System Profile to his Private Message inbox.
Thanks,
Perry
Thanks for following up, I will ping @ssax, and let him know that you are going to send an updated System Profile to his Private Message inbox.
Thanks,
Perry
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Service Check Timeouts
If it's the same hosts that are timing out, does it occur on a consistently periodic fashion at around the same times?
Is there any consistency to the timing of them failing?
Since it's the same ones in the same subnets, are you checking them over a VPN tunnel that could be having issues/routing issues? (VPN tunnels can bet setup by subnet or by host so if it's a subnet based VPN tunnel and it dropped/re-established it would take down all hosts in that tunnel as an example)
If it happens during the same times, check backups, vmotions, off-server jobs, vulnerability scanning, etc that could be causing the systems OR the tunnel/interfaces in the network path to overload and drop packets. I've seen all of those take down systems like that in a network. I worked at a place that had an old router that when we implemented vulnerability scanning and it scanned the remote systems it would overload the router interface (too much data for the old hardware) and the cause connectivity issues.
Those are some good places to start. I would also check the network statistics on the network device interfaces in the path, maybe sometimes you get an invalid route/asymmetric routing if you're using some type of protocol such as bgp/eigrp/ospf.
If it was network issues globally with the XI server you'd be having other hosts/services with the same issues so it's likely external to the XI server causing it.
You can generally increase the plugin timeouts to account for it but you may need to investigate the network path to determine where the failure is coming from if it's impacting entire subnets.
Is there any consistency to the timing of them failing?
Since it's the same ones in the same subnets, are you checking them over a VPN tunnel that could be having issues/routing issues? (VPN tunnels can bet setup by subnet or by host so if it's a subnet based VPN tunnel and it dropped/re-established it would take down all hosts in that tunnel as an example)
If it happens during the same times, check backups, vmotions, off-server jobs, vulnerability scanning, etc that could be causing the systems OR the tunnel/interfaces in the network path to overload and drop packets. I've seen all of those take down systems like that in a network. I worked at a place that had an old router that when we implemented vulnerability scanning and it scanned the remote systems it would overload the router interface (too much data for the old hardware) and the cause connectivity issues.
Those are some good places to start. I would also check the network statistics on the network device interfaces in the path, maybe sometimes you get an invalid route/asymmetric routing if you're using some type of protocol such as bgp/eigrp/ospf.
If it was network issues globally with the XI server you'd be having other hosts/services with the same issues so it's likely external to the XI server causing it.
You can generally increase the plugin timeouts to account for it but you may need to investigate the network path to determine where the failure is coming from if it's impacting entire subnets.
-
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: Service Check Timeouts
Theres no correlation concerning timing that i can see, but it seems to be the check_Unanswered Messages in MSG Queue (check AS400 msg plugin) across those 4 hosts. I would think if it was a network issue, we would see different service checks dropping, not just that one. Can someone please look into the profile i sent for service check timeouts concerning that service?
Our network is all internal, and most of the timeouts occur without rhyme or reason. The firewall is all open, and i dont see any drops anywhere
Our network is all internal, and most of the timeouts occur without rhyme or reason. The firewall is all open, and i dont see any drops anywhere
-
- Dreams In Code
- Posts: 7682
- Joined: Wed Feb 11, 2015 12:54 pm
Re: Service Check Timeouts
I see this consuming a lot of CPU:
I see these as well (will cause gaps in your graphs):
https://support.nagios.com/kb/article.php?id=9
Please send the output of these commands:
Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the command if you've changed the root mysql password
Please try disabling that deep security agent and see if that is slowing down your checks and causing them to hit a limit and timeout. The assumption is that everything that nagios is doing is slowed down by the agent scanning for threats. That would be my first guess based on what you're saying. If that resolves it you would either need to contact the agent vendor and ask them what can be done or increase the timeouts on your checks.6654 root 20 0 9768056 361804 40084 S 305.6 1.1 8087:20 ds_am
I see these as well (will cause gaps in your graphs):
[01-05-2022 18:05:30] NPCD: WARN: MAX load reached: load 48.040000/10.000000 at i=1
[01-05-2022 18:05:45] NPCD: WARN: MAX load reached: load 49.040000/10.000000 at i=1
[01-05-2022 18:06:00] NPCD: WARN: MAX load reached: load 47.240000/10.000000 at i=1
[01-05-2022 18:06:15] NPCD: WARN: MAX load reached: load 50.650000/10.000000 at i=1
Please follow this guide to set your load_threshold to 80.0 and your TIMEOUT to 40:2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Timeout after 20 secs. ***
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Please check your npcd.cfg
https://support.nagios.com/kb/article.php?id=9
Please send the output of these commands:
Code: Select all
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache
netstat -s
ethtool -S eth0
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the command if you've changed the root mysql password
Code: Select all
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table
-
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: Service Check Timeouts
Are you using the new profile I sent in PM? The load issue has been resolved.
Confirmed with networking were not capping our threshold limits, same with IOPS for storage.
Please confirm you are using the latest profile, created 2/8/2022
Confirmed with networking were not capping our threshold limits, same with IOPS for storage.
Please confirm you are using the latest profile, created 2/8/2022