Mod_Gearman causing higher CPU load on NagiosXI

nms · Post by **nms** » Mon Mar 22, 2021 5:19 am

Dear Support,

We have a NagiosXI installation v5.7.3 running on a (still for now) Centos6.10
This instance has 656 hosts with 19k services.

Lately, we installed mod-gearman service to achieve the main goal of reducing the Nagios CPU load.
The worker is a remote worker which is "extracting" the load from Nagios via a service group configuration.
This service group in Nagios holds just 588 services (we called it WORKER_STPGw), thus this remote worker is handling these for the moment.

In Nagios configuration (nagios.cfg), the NEB module is:

Code: Select all

# Added by NDO 'make install-broker-line' on Wed Sep  9 10:50:48 CEST 2020
broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg


broker_module=/usr/lib64/mod_gearman/mod_gearman_nagios4.o config=/etc/mod_gearman/module.conf eventhandler=no

Attahed is the file "/etc/mod_gearman/module.conf" for your convenience.

module.conf.txt

It seems that instead of reducing the load on the Nagios server, this caused a higher CPU than what we usually see.
In the graph attached (NagiosCPU.jpg), you will see that when the gearmand daemon was started, the average CPU spiked up instead of going down as expected.

NagiosCPU.jpg

When looking at the /var/log/gearmand/gearmand.log, I just noticed these connection errors.

Code: Select all

ERROR 2021-02-22 09:47:21.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2021-02-22 09:47:21.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

Those errors may be coming from the worker complaining that it cannot connect to the gearman server. The worker is located in a different geographical area, however from the looks of it, it seems it's working well.

Attached is the worker configuration "/etc/mod_gearman/worker.conf"

worker.conf.txt

The higher CPU was proved to be the worker as when the gearmand was stopped, the load on Nagios returned as it was before.

if we take a look at gearman_top i see the following queues, which i suspect that should be correct.

Code: Select all

[root@am1-sha-nagios2-p etc]# gearman_top -b
2021-03-22 10:59:44  -  localhost:4730  -  v0.33

 Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
 check_results             |               1  |           0  |           0
 servicegroup_WORKER_STPGw |             200  |           0  |           1
 worker_bru-nms-stpgw-p    |               1  |           0  |           0
----------------------------------------------------------------------------

What could be the cause of having a higher load when the worker daemon is running?
Do we have something missing or to adjust in our server/worker configurations?
Please let us know any other requirements you need to follow up on this issue.

Rgds,
Matthew

Post by **vtrac** » Mon Mar 22, 2021 4:47 pm

Hi,
On your Gearman server (Nagios XI), please find the PID of your "gearmand" and run the below command:

To find out the PID of gearmand:

Code: Select all

ps -ef | grep gearmand

Get the PID of the last command, then run the below command with that PID and post the output here:

Code: Select all

prlimit --pid PID | grep NOFILE

Also, please run this on your gearman server as well:

Code: Select all

netstat -anp | grep 4730 | wc -l
or
ss -anp | grep 4730 | wc -l

Can you please upload the "/etc/security/limits.conf" file?

Please also upload the "profile.zip" to this post/ticket.

Regards,
Vinh

nms · Post by **nms** » Tue Mar 23, 2021 3:28 am

Hi Vinh,

Thanks for the follow-up.

Here's what requested: (Note that prlimit is not available on CentOs6, but i used another option as shown below (should still display the same output as prlimit)

Code: Select all

 ps -ef | grep gearmand
gearmand 16124     1  1 Mar22 ?        00:27:09 /usr/sbin/gearmand -d --worker-wakeup=10 --retention-file=/tmp/gearmand.retention -q retention --log-file=/var/log/gearmand/gearmand.log
[root@am1-sha-nagios2-p ~]#
[root@am1-sha-nagios2-p ~]#
[root@am1-sha-nagios2-p ~]# grep "open files" /proc/16124/limits
Max open files            10000                10000                files
[root@am1-sha-nagios2-p ~]#
[root@am1-sha-nagios2-p ~]#
[root@am1-sha-nagios2-p ~]# netstat -anp | grep 4730 | wc -l
321

Limits file and the profile are attached.

limits.conf.txt

profile.zip

Rgds,

Post by **vtrac** » Tue Mar 23, 2021 4:25 pm

Hi,
Looking at the profile.zip I noticed a few things.

Output of the "top" command (below) showed:
- Huge amount of "Tasks" (799) are running
- Your load average is very high (13.12, 10.72, 12.50)
- mysql is running with a %124 CPU and has been running a long time (7552 hours).
- You had a huge amount of "check_by_ssh" scripts running

Code: Select all

top - 09:27:19 up 166 days, 19:47,  3 users,  load average: 13.12, 10.72, 12.50
Tasks: [icode]799[/icode]total,   2 running, 797 sleeping,   0 stopped,   0 zombie
Cpu(s): 46.6%us, 14.9%sy,  0.0%ni, 34.6%id,  3.2%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:  16466356k total, 15361780k used,  1104576k free,   107620k buffers
Swap:  8241148k total,    92188k used,  8148960k free, 12958576k cached

  PID USER      PR  NI  VIRT  RES  SHR  S %CPU %MEM    TIME+   COMMAND            
 2948 mysql     20   0 4314m 141m 5084 S 124.8  0.9     7552:42  mysqld

I also noticed lots of below warning in your nagios.log (please check):

Code: Select all

[1616488038] Warning: Return code of 127 for service '210_S-ipops-BFX014-DIAMETER-DetailledTrafficPerConnection-Requests Received by BFX to ORANGE-NN-BXLHSS02' on host 'bru-owf-hlrdra01-p_v-ncc' may indicate this plugin doesn't exist.
[1616488038] Warning: Return code of 127 for service '210_S-ipops-BFX013-DIAMETER-DetailledTrafficPerConnection-Requests Send by DRA to JPU-Broker-A' on host 'vip-jpu-hlrdra01-p_v-ncc' may indicate this plugin doesn't exist.
[1616488038] Warning: Return code of 127 for service '210_S-ncc-BFX012-DIAMETER-TrafficPerConnection am1-rtcg01-vfmt-temp05-p' on host 'bru-int-ggsndra01-p_v-ncc' may indicate this plugin doesn't exist.
[1616488038] Warning: Return code of 127 for service '210_S-ncc-BFX012-DIAMETER-TrafficPerConnection am1-vfie-slc02-01-p-i' on host 'bru-int-ggsndra01-p_v-ncc' may indicate this plugin doesn't exist.
[1616488038] Warning: Return code of 127 for service '210_S-ipops-BFX013-DIAMETER-DetailledTrafficPerConnection-Requests Send by DRA to TIS-MILAN' on host 'bru-tis-hlrdra01-p_v-ncc' may indicate this plugin doesn't exist.

Here are my recommendations:

Run the below script to repair your database:

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh

Configure workers with the followings:

Code: Select all

max-worker=1000
max-jobs=1000
spawn-rate=50

Reboot your machine, something might be hanging:

Code: Select all

reboot

You might also want to increase the "ulimit" if issue still there after reboot.

Regards,
Vinh

Post by **vtrac** » Tue Mar 23, 2021 4:34 pm

Hi,
Based on the "top" outputs, I don't think your server is in good condition.

It might be better to reboot your server first with the "reboot" command since everything will be so slow and the repair database might failed.

Regards,
Vinh

Nagios Support Forum

Mod_Gearman causing higher CPU load on NagiosXI

Mod_Gearman causing higher CPU load on NagiosXI

Re: Mod_Gearman causing higher CPU load on NagiosXI

Re: Mod_Gearman causing higher CPU load on NagiosXI

Re: Mod_Gearman causing higher CPU load on NagiosXI

Re: Mod_Gearman causing higher CPU load on NagiosXI