xi: NDO3 -- instability / poor performance at scale

inversecow · Post by **inversecow** » Wed Mar 17, 2021 2:35 pm

Ahoy folks,

We run multiple instances of Nagios xi to monitor my customer environments.
This includes a split in an attempt to distribute load (EG: ACTIVE vs PASSIVE checks), where each instance is self-contained.
These were all initially installed on the xi 5.6.x release series, on-top of RHEL 7.x virtual machines, with off-box dBs (for the two PROD ones, the third is vendor provided / on-box).
It is understood that hosting the 3 dBs off-box will add some for various actions (EG: apply_config will take a little longer, due to the involved interchange across the network vs local traffic within a box).

For a variety of reasons we have been keen to upgrade away from xi 5.6.x.
Given this is for enterprise monitoring, we are of course interested in keeping abreast with security updates and bug fixes, to say nothing of benefitting from new features & functionality.
Also the APP to dB interchange subsystems included upto the end of xi 5.6.x release series has historical led to significant grief and instability (all day impacts, unhappy users & MGMT, etc.)
Through review of the forums, contact with SUPP (via tickets), and independent research on the KB, we have made some extensive attempts at performance tuning to accommodate / optimize the OS and APP for NDO2db through-put.
It was with no small measure of delight when xi 5.7.x release series was announced (with a much anticipated replacement, NDO3).

Of course xi 5.7.x release series has not been without challenges of its own.
Three attempts made, with all ultimately failing just short of the finish line, for a variety of reasons.

The accelerated release of xi 5.8.x series had given hope, as previous bugs continued to get fixed and the new sub-systems saw version upgrades (NDO3, CCM, and most recently NRDP).
However, one issue remains throughout that "kills" our upgrade attempts every time.
Specifically, it would appear that NDO3 sub-system is not scaling well under load in the largest of our PROD deployments (seems to run fine in our smaller deployments).

Having reviewed the forums, I continue to see references to downgrade back to NDO2db as a recommended "quick fix".
However, I have not observed much in the way of advancement on the topic of improving NDO3 stability / performance (a worthy long-term goal one might think).

Thus, I wished to seek guidance on how to go about tuning / making NDO3 operate better at scale?
In our most recent attempt to upgrade (last night, 5.6.9 --> 5.8.1 as such was available in your [vendor repos](https://repo.nagios.com/?repo=rpm-rhel)), we had challenges but "got there" with regards to xi being upgraded.
This includes the following:
- xi apparently started successfully
- reporting version 5.8.1
- "survived" APP restarts
- dB transactions apparent on (off-box) dB host

- monitor engine running
- monitors scheduled in queue
- monitors updating with results when checks run

However, the following was noted / observed:
- server statistics
- CPU would spike (~10.x+), than bottom-out (sub ~0.5x) for load
- CPU Stats would also indicate spikes
- spikes / troughs coincide with next major point

- monitoring engine check statistics
- monitor queues would process briefly, than "drain"
- for example the "1-min" queues for both ACTIVE HOST && PASSIVE SERVICE checks would "zero out"
- followed by the same happening to the "5-min" queues
- would eventually "self-recover" without intervention, run for a time, than manifest all over again

- monitoring engine process
- held steady (green board) throughout

- system component status
- held steady (green board) throughout

- monitoring engine event queue
- scheduled events over time
- the so-called "banana road" would show peaks and troughs
- would appear to indicate "bursts" or "spurts" of monitors scheduled / executed
- conversely, in xi 5.6.x, this "runs steady" with a generally consistent load average of "250" (not accounting for bursting of events)

- service status
- using test PASSIVE service monitors, it was noted that xi would become extremely "sluggish" and all-together "miss" PASSIVE service state changes

Restarting `nagios` core proc did little to correct the issue.
Nagios, xi, & NDO config files were reviewed and confirmed settings to off-box dB were correct.
Confirmed the NDO3 "broker_module" string was properly defined.

We eventually "rolled-back" on the change (revert to snap-shots for APP && dB nodes, taken prior to the start of change), and xi was quickly back in PROD service (at xi 5.6.9 levels).
Prior to rolling back, I collected "xi Profile" and full APP dumps (~13 GB tar-balls, via `backup_xi.sh`) of the faulty xi 5.8.1 state, and the previous functional xi 5.6.9 state.
In this way, we are preparing now to conduct RCA / post-mortem (as best able) on what went wrong and how we might achieve *lasting success* on the next attempt.

All of this said, what knowledge / solutions / tuning opportunities exist to improve the performance of the NDO3 sub-system for use in a "large" deployment scenario?

---

For clarification, in my particular ENV, the "problem" large instance is consistent of the following general points:

- VM hosts:
- xi: 6 vCPUs, 32 GB MEM, SAN disks
- dB: 4 vCPUs, 32 GB MEM, SAN disks
- OS: RHEL 7.x without any special customizations, patched regularly
- dB: MariaDB 5.5.64.x series (off-box, managed instance by in-house DBA folks)
- Monitors:
- ACTIVE: ~4,200 HOST monitors (`check-host-alive`, which uses `check_icmp`, for UP/DOWN monitors)
- PASSIVE: ~35,000 SERVICE monitors (`check_dummy`, PASSIVE monitors that receive events from NCPA/NRDP deployed on our monitored hosts plant)
- Notes:
- we are not using `mod_gearman`, as it was found to be ill-suited for our ENV

ssax · Post by **ssax** » Thu Mar 18, 2021 3:33 pm

The new NDO3 has issues on some systems (generally large ones), I would test xi 5.8.2 that has the latest NDO3 updates in it.

But given that NDO3 has some issues on some systems, I recommend that you upgrade to xi 5.8.2, test it to see if you have issues, and if you do, then downgrade NDO3 back to NDO2DB (your xi version stays the same) via these instructions and that should resolve it:

Run these commands as root:

Code: Select all

systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi
./init.sh
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db

If you have an offloaded database you will need to edit your /usr/local/nagios/etc/ndo2db.cfg file and update these before running the next command to start it up:
- You can get the info from your /usr/local/nagios/etc/ndo.cfg or from /usr/local/nagiosxi/html/config.inc.php

Code: Select all

db_host
db_port
db_user
db_pass

Then run this command to start it up.:

Code: Select all

systemctl start ndo2db

Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented/add it if needed:

Code: Select all

broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

Make sure all occurrences of this line are commented:

Code: Select all

#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg

Then start the nagios service:

Code: Select all

systemctl start nagios

Then Apply Configuration.

inversecow · Post by **inversecow** » Wed Apr 07, 2021 2:26 pm

Thank you for the steps below.

FYI, we attempted this upgrade once again (using xi 5.8.2 release for this cycle).
As with prior attempts, all went relatively well until we struck upon our "large" instance (as was described in the initial post).

Initially NDO3 performance seemed to work somewhat better on xi 5.8.2:
- scheduling queues populated / did not zero out as was observed in prior attempts
- nagios core proc cycled noticeably faster than under `ndo2db` (~30 seconds vs ~5 mins)

However, we once again noted "peaks & troughs" in the "Monitor Engine Event Queue / Scheduled Events Over Time" gauge.

We then opted to "downrev" to NDO2DB (via your instructions).
This was not without challenge, as it was found we required two additional packages (gcc, mariadb-devel) & dependencies to complete the build from source.
Once these were in place, we were able to complete the build (failed at tail end, since it expected on-box dB and we use off-box).

All of this said, once "switched over" (as instructed) to the ndo2db broker, the nagios core service entered a state of "perpetual thrashing".
This manifested as follows:
- high CPU load / consumption on the APP node (12.0+ on a 6 vCPU VM)
- large bursting observed in the kernel message queue (203K events that keep re-occurring, triggering nagios core proc to SIGTERM and restart, slowly descending over the span of 4+ hours)
- dB appeared largely unimpacted (thread counts did not peak as had been observed in past instance of such load events)
- "flushing" the kernel message queue had little effect

After many hours of "processing" time, the queue settled and the nagios core came online.
Once observed, the next apply_config set off the same load event all over again.

Switching back to NDO3 broker resolved this almost immediately.

Our users now report "flapping" in terms of xi based monitors they have in place.
I suspect this is due to some "uneven" processing of the events queue, as demonstrated by the gauges.

What tuning may we do to stabilize our NDO3 broker on xi 5.8.3?

Attachments:
- xi 5.6.9 "Monitor Engine Event Queue / Scheduled Events Over Time"
- xi 5.8.2 "Monitor Engine Event Queue / Scheduled Events Over Time"

ssax · Post by **ssax** » Thu Apr 08, 2021 3:56 pm

I would upgrade to xi 5.8.3 and try it out to see if it resolves your issues.

If it doesn't, please create a ticket for this and include a link back to this forum thread so we can get a remote session setup to debug further:

https://support.nagios.com/tickets/

Thank you!

Nagios Support Forum

xi: NDO3 -- instability / poor performance at scale

xi: NDO3 -- instability / poor performance at scale

Re: xi: NDO3 -- instability / poor performance at scale

Re: xi: NDO3 -- instability / poor performance at scale

Re: xi: NDO3 -- instability / poor performance at scale