Monitoring engine stops working and graphs not populating

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Monitoring engine stops working and graphs not populating

Post by nms »

Hi guys,

We are trying to migrate from centos 6 to centos 7 using same version of nagios 5.7.3

With reference to https://support.nagios.com/forum/viewto ... 16&t=62421, we installed a new VM with centos 7 and nagios 5.7.3
Additionally, we did a backup and restore from the centos 6 with the same nagios version, but after the restore it seems that monitoring engine stops working from time to time which leads to graphs not populating. As per topic pasted above, we did NOT execute the restore_repair.sh script.

Can you kindly guide us on what we need to check in order to solve the issue?

Please also find attached the profile.



Best regards,
nms
You do not have the required permissions to view the files attached to this post.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: Monitoring engine stops working and graphs not populatin

Post by vtrac »

Hi nms,
Hope you are having a great day!! ... :-)

I looked at the "profile.zip" and noticed a few things.

You lost connection with your database:
.. <p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
<p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
<p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
. <p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
<p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>


You have lots of INSERT issues:
[1625748999] NDO-3: The following query failed while MySQL appears to be connected:
[1625748999] NDO-3: INSERT INTO nagios_servicechecks (instance_id, start_time, start_time_usec, end_time, end_time_usec, service_object_id, check_type, current_check_attempt, max_check_attempts, state, state_type, timeout, early_timeout, execution_time, latency, return_code, output, long_output, perfdata, command_object_id, command_args, command_line) VALUES (1,FROM_UNIXTIME(1625748996),424887,FROM_UNIXTIME(1625748997),772627,35300,0,1,3,0,1,120,0,1.347740,5.354317,0,'NACK statistics on voicemo for VFNL-WYLS are nack_insf=0:nack_cris=0:nack_nacc=0:nack_nbty=0:nack_nrat=0:nack_wdis=0:nack_tmny=0:nack_nena=0:nack_nbill=0:','','nack_insf=0;nack_cris=0;nack_nacc=0;nack_nbty=0;nack_nrat=0;nack_wdis=0;nack_tmny=0;nack_nena=0;nack_nbill=0;',0,'','') ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), start_time = VALUES(start_time), start_time_usec = VALUES(start_time_usec), end_time = VALUES(end_time), end_time_usec = VALUES(end_time_usec), service_object_id = VALUES(service_object_id), check_type = VALUES(check_type), current_check_attempt = VALUES(current_check_attempt), max_check_attempts = VALUES(max_check_attempts), state = VALUES(state), state_type = VALUES(state_type), timeout = VALUES(timeout), early_timeout = VALUES(early_timeout), execution_time = VALUES(execution_time), latency = VALUES(latency), return_code = VALUES(return_code), output = VALUES(output), long_output = VALUES(long_output), perfdata = VALUES(perfdata), command_object_id = VALUES(command_object_id), command_args = VALUES(command_args), command_line = VALUES(command_line)


You ran out of memory:
Jul 8 08:56:41 bru-nms-nagios-p kernel: Out of memory: Kill process 20860 (nagios) score 922 or sacrifice child
Jul 8 08:56:41 bru-nms-nagios-p kernel: Killed process 20866 (nagios) total-vm:10844kB, anon-rss:176kB, file-rss:0kB, shmem-rss:0kB
Jul 8 08:56:41 bru-nms-nagios-p kernel: Out of memory: Kill process 20860 (nagios) score 922 or sacrifice child
Jul 8 08:56:41 bru-nms-nagios-p kernel: Killed process 20872 (nagios) total-vm:10844kB, anon-rss:180kB, file-rss:0kB, shmem-rss:0kB
Jul 8 08:56:41 bru-nms-nagios-p kernel: Out of memory: Kill process 20860 (nagios) score 922 or sacrifice child
Jul 8 08:56:41 bru-nms-nagios-p kernel: Killed process 20860 (nagios) total-vm:127009036kB, anon-rss:15040676kB, file-rss:0kB, shmem-rss:0kB


Please try the below commands:

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
systemctl restart mariadb.service

systemctl stop httpd
systemctl stop crond
systemctl stop npcd
systemctl stop nagios
pkill -9 -u nagios
pkill -9 -u apache
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done

systemctl restart mariadb
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -h 127.0.0.1 -uroot -pnagiosxi nagiosxi

systemctl start nagios
systemctl start npcd
systemctl start crond
systemctl restart httpd

rm -f /usr/local/nagiosxi/var/dbmaint.lock
php /usr/local/nagiosxi/cron/dbmaint.php

As to the out of memory issue, I noticed you have huge amount of checks running at around "08:56" AM today.
You can see that in "/var/log/messages"

Please check and see why you have that many running check at once, which caused you ran out of memory ... I think.


Best Regards,
Vinh
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: Monitoring engine stops working and graphs not populatin

Post by nms »

Hi,

We have performed a new fresh installation of Centos7 bearing the same version of Nagios (5.7.3) as the same one running on the Centos6.
Restored from a backup successfully, ran a DB repair but ran into these issues:

1. NDO
From the nagios log I see a lot of these messages:

Code: Select all

NDO-3: The following query failed while MySQL appears to be connected:
The Database (now mariadb) is running on the same server, there is no offload involved here.

2. When looking at the Nagios state I saw it as down (killed). Restarted the service but I see this error:

Code: Select all

: WARNING: RLIMIT_NPROC is 63444, total max estimated processes is 159492! You should increase your limits (ulimit -u, or limits.conf)
3. Although the Nagios process restarts successfully, from time to time this is being killed as we noticed happening in the GUI.
Screenshot_1.jpg
4. The measurements for bandwidth all dropped to 0. I could see from the mrtg cfg file that the rrdtool perl library was 5.10.1, but when checking on the new centos7 installation I see that this is version 5.16.3. Thus I updated the mrtg cfg file, however, I'm not sure if I need to restart anything here.

5. graphs are not being populated.

Can you kindly assist in solving these issues as we need to make sure Nagios is running fine on CentOS7 before we move forth on upgrading the other instances?

*Note that the commands given to be performed have also been executed.

Re-attaching the profile.
profile.zip
Rgds,
Matthew
You do not have the required permissions to view the files attached to this post.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: Monitoring engine stops working and graphs not populatin

Post by vtrac »

Hi,
Hope you are having a great Monday!! ... :-)

I noticed the following errors:

Code: Select all

<p><pre>SQL Error [nagiosxi] : Table 'nagiosxi.xi_commands' doesn't exist</pre></p>
<p><pre>SQL Error [nagiosxi] : Table 'nagiosxi.xi_sysstat' doesn't exist</pre></p>
<p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
I have attached the "xi_573.sql" file.
Here's how to install:
Assuming that you downloaded the "xi_573.sql" file and put that under "/tmp".

Code: Select all

cd /tmp
mysql -f -uroot -pnagiosxi nagiosxi < /tmp/xi_573.sql

systemctl restart mariadb

systemctl restart nagios

I also found this post that you can take a look at for increasing the NPROC setting.
https://serverfault.com/questions/62861 ... n-centos-7


Also, please run the following commands as root and post the output here.

Code: Select all

echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi
Also, please run the below command to check your "max_connection" settings:

Code: Select all

mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';"
If your Max Connection is below "151", please see the KB below on how to increase the Max Connections:
https://support.nagios.com/kb/article/n ... s-513.html


I also noticed that at around "12:29", you ran out of memory and swap based on the "/var/log/messages":

Code: Select all

Jul 12 12:29:18 bru-nms-nagios-p kernel: Out of memory: Kill process 9305 (nagios) score 874 or sacrifice child
Jul 12 12:29:18 bru-nms-nagios-p kernel: Killed process 9313 (nagios) total-vm:10844kB, anon-rss:220kB, file-rss:0kB, shmem-rss:0kB
Jul 12 12:29:18 bru-nms-nagios-p kernel: systemd-journal invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jul 12 12:29:18 bru-nms-nagios-p kernel: systemd-journal cpuset=/ mems_allowed=0
Seems like there are lots of check_nrpe running, any ideas why all at the same time?
Please check the "/var/log/messages" on your Nagios XI system for more details.

Please talk to your system admin, if you think you need more memory addeded.


Best Regards,
VInh
You do not have the required permissions to view the files attached to this post.
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: Monitoring engine stops working and graphs not populatin

Post by nms »

Hi,

I noticed that the reason for OOM was due that the Nagios process was being killed after reaching 100%. It was noticed that after downgrading the NDO-3 it was all back to normal. graphs were being updated too.

At this stage, I think we can lock this thread.
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Monitoring engine stops working and graphs not populatin

Post by benjaminsmith »

Hi,
At this stage, I think we can lock this thread.
Sounds good. We'll close this out but feel free to open another post if you have any new questions.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Monitoring engine stops working and graphs not populatin

Post by benjaminsmith »

Hi,
At this stage, I think we can lock this thread.
Sounds good. We'll close this out but feel free to open another post if you have any new questions.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!