Performance graph gaps

nosajche · Post by **nosajche** » Fri Oct 09, 2020 10:27 am

Hello,

We are running into some issues with gaps our performance graphs.

All services are showing green but for hours at a time, chunks of data are missing from our performance graphs, typically off regular working hours. We have followed the instructions found in the documentation and made a few changes but still have not pinned down what is going on.

We took the following actions:

1. Upped the verbosity of both NPCD and perfdata
2. Confirmed the nagios account has not expired
3. We noted errors re: load threshhold and adjusted the load_threshold of NPCD to 20 and restarted NPCD.

Here are the spooled files count-- it doesn't meet the 20k number cited in the article.

Code: Select all

$ ls /usr/local/nagios/var/spool/perfdata/ | wc -l
2
$ ls /usr/local/nagios/var/spool/xidpe/ | wc -l
4707

From perfdata.log, logging stops being written to it exactly when the missing data starts on the GUI.

From npcd.log, we are seeing the following for every check:

Code: Select all

[10-09-2020 11:17:32] NPCD: ThreadCounter 0/5 File is 1599774829.perfdata.service-PID-15586
[10-09-2020 11:17:32] NPCD: File '1599774829.perfdata.service-PID-15586' is an already in process PNP file. Leaving it untouched.
[10-09-2020 11:17:32] NPCD: DEBUG: load 1.970000/20.000000
[10-09-2020 11:17:32] NPCD: ThreadCounter 0/5 File is 1600195788.perfdata.host-PID-20283
[10-09-2020 11:17:32] NPCD: File '1600195788.perfdata.host-PID-20283' is an already in process PNP file. Leaving it untouched.

Additionally, we saw some of the following errors in messages.log:

Code: Select all

Oct  9 06:36:57 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1602239817.perfdata.host" - errno: Cannot allocate memory
Oct  9 06:37:11 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1602239831.perfdata.service" - errno: Cannot allocate memory
Oct  9 06:37:12 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1602239831.perfdata.host" - errno: Cannot allocate memory
Oct  9 06:37:27 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1602239847.perfdata.service" - errno: Cannot allocate memory
Oct  9 06:37:27 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1602239847.perfdata.host" - errno: Cannot allocate memory

However, we have confirmed that we did not stress the memory on the ESXi. The VM has 4 CPU and 8 GB RAM for reference.

Any ideas where we should be looking to resolve?

Thanks,

nosajche · Post by **nosajche** » Fri Oct 09, 2020 2:20 pm

Hello,

Did some more digging and found some PHP errors due to low memory_limit:

Code: Select all

PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 79 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550
PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 354 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550

We did the following:

1. Deleted all files in: /usr/local/nagios/var/spool/perfdata folder.
2. Changed php.ini memory_limit to 1024 MB.
3. Restarted httpd service
4. Restarted perfdataproc.php from nagios user

After these steps, data started graphing again.

A few follow up questions:
--What causes the perfdata pile-up in that location-- is there a way to detect that?
--Is there a general guidance for how we should optimize Nagios XI for deployments with a larger number of services/hosts?
--Are there any other dependency settings other than PHP that need to be tweaked to optimize for the VM and intended environment size?

Thanks.

Post by **cdienger** » Fri Oct 09, 2020 4:45 pm

The process that move files from the xidpe folder to the perfdata is a php job so reaching a php limit would explain why it would be failing to move things from that directory. It likely impacts the the next step which is for process_perfdata.pl to process the contents of the perfdata directory. I've attached a chart showing the flow of performance data.

Tweaking the PHP limits is a common recommendation. Check out https://support.nagios.com/kb/article/n ... e-611.html which covers increasing the memory limit as well as a few more settings in the php.ini.

https://assets.nagios.com/downloads/nag ... ios-XI.pdf covers some other performance tweaks for the XI system. I usually recommend at least following the steps to add a ramdisk for perfdata.

nosajche · Post by **nosajche** » Sat Oct 10, 2020 12:05 am

Thanks for this info.

Everything was working fine for a few hours but the graphs stopped generating a few hours later. However, this time the logs do not mention any PHP errors and there are no files in the xidpe or perfdata folders.

There is only the following message in the messages log:

Code: Select all

Oct 10 01:00:39 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1602306038.perfdata.host" - errno: Cannot allocate memory
Oct 10 01:00:39 dltfanxi1 nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1602306039.perfdata.service" - errno: Cannot allocate memory

The httpd log does not have any errors but I increased the memory limit in PHP and restarted httpd anyway but am still getting the same problem....

Code: Select all

PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 250 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 541
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 81 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 551
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 79 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 79 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 81 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 551
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 79 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550
PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 79 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 550
PHP Fatal error:  Allowed memory size of 2147483648 bytes exhausted (tried to allocate 32 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 541
PHP Fatal error:  Allowed memory size of 2147483648 bytes exhausted (tried to allocate 32 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 541
PHP Fatal error:  Allowed memory size of 2147483648 bytes exhausted (tried to allocate 32 bytes) in /data/local/nagiosxi/cron/perfdataproc.php on line 541

These errors populate as soon as you try to restart the perfdataproc.php.

Post by **cdienger** » Mon Oct 12, 2020 1:34 pm

How large are are the perfdata files under /usr/localnagios/var/ ? Try removing them with:

Code: Select all

systemctl stop nagios
mv /usr/local/nagios/var/host-perfdata ~
mv /usr/local/nagios/var/service-perfdata ~
systemctl start nagios

nosajche · Post by **nosajche** » Thu Oct 15, 2020 9:18 am

The perfdata files are pretty small:

Code: Select all

$ ll -h | grep perfdata
-rw-r--r-- 1 nagios nagios 6.4K Oct 15 10:10 host-perfdata
-rw-rw-r-- 1 nagios nagios 5.7M Oct 10 14:04 perfdata.log
-rw-r--r-- 1 nagios nagios 116K Oct 15 10:10 service-perfdata

The same cycle keeps happening--

1. PHP runs out of memory (even though its been increase to 2 GB on an 8 GB memory VM)
2. perdataproc.php stops running
3. /var/spool/xidpe/ increases and never processes the files

Deleting the xdipe folder contents and restarting the perdataproc.php process as nagios user works for a few hours, and then the cycle repeats.

Post by **tgriep** » Thu Oct 15, 2020 4:41 pm

Ticket open for this issue so we'll work through the issue there. Closing this post.

Nagios Support Forum

Performance graph gaps

Performance graph gaps

Re: Performance graph gaps

Re: Performance graph gaps

Re: Performance graph gaps

Re: Performance graph gaps

Re: Performance graph gaps

Re: Performance graph gaps