xi instance took a nosedive

This support forum board is for support questions relating to Nagios xi, our flagship commercial network monitoring solution.
cbeattie-unitrends
Posts: 84
Joined: Mon Oct 10, 2016 2:51 pm

xi instance took a nosedive

Post by cbeattie-unitrends »

Hello,

One of our Nagios xi instances has started acting up. I'm not sure what has happened, but there are several symptoms. There is nothing showing in the Monitoring Engine Event Queue dashlet, and all zeros on the Monitoring Engine Check Statistics dashlet. The Monitoring Engine Performace dashlet does show a few values, but the max service check latency is almost an hour (3,445.18 seconds), with an average of 561.66 seconds.

If I restart Nagios (usually with 'systemctl stop nagios && systemctl stop ndo2db && systemctl restart mariadb && systemctl start ndo2db && systemctl start nagios' just to be thorough), the load average can go into the 800s before it calms down. We normally see a flurry of activity on restarts (load average maybe in the 50s or 60s), but they settle down in a few minutes. Our normal load average is 3 to 5 on these 12 vCPU VMs running Nagios xi, and less than 2 on our smaller ones.

This xi instance normally does about 2,500 hosts and 77,000 services, so I've enabled large installation tweaks and use a RAM disk. I tuned the message parameters, too.

Code: Select all

sysctl -p
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
However, after restarts, the queue fills up faster than it can be emptied and eventually it's full.

Code: Select all

ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xbd000040 196609     nagios     600        262144000    256000
I tried doubling the msgmnb, msgmax, and msgmni values from their current ones, but all that did was make it take longer for the queue to fill up. I have another instance, about 2/3 the size of this one with the same installation tweaks. Watching ipcs on that one shows a couple hundred, maybe a thousand, and they all get processed within a second or two.

Looking at the database log, I've tried running the database repair script multiple times. It's not uncommon for the database to need repair in our environment, but not like this. The repair appears to succeed, at least temporarily, but the nagios_statehistory table is showing up over and over.

Code: Select all

210428 13:06:03 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.64-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
210428 16:20:01 [ERROR] mysqld: Table './nagios/nagios_statehistory' is marked as crashed and last (automatic?) repair failed
210428 16:25:01 [ERROR] mysqld: Table './nagios/nagios_statehistory' is marked as crashed and last (automatic?) repair failed
I've tried to limit the databases to keeping no more than 1-7 days ("right now" is more important to us than historical data) to keep them more manageable. But even truncating the statehistory table hasn't fixed this problem. I would be fine with starting over with a new database if I can do that without losing all the configuration.

This is CentOS 7, so I updated the GRUB options to force an fsck on reboot. Reboots seem to go just fine, so I'm guessing there's no filesystem corruption.

Editing to add: this VM has 10GB of memory assigned to it, and the OOM killer has started taking out mysqld processes. I don't know a lot about databases, but that sounds suboptimal.

I would have attached a profile to this, but this is all this system can produce:

Code: Select all

PROFILE BUILD FAILED
Array
(
)
CODE: 1
If the profile from the other, smaller but otherwise fairly similar installation would be useful, let me know and I'll attach that instead. It did, eventually, return this when I clicked on View System Info:

Code: Select all

System Profile
A system profile makes it easier for our support techs to understand the system that you are running on. Including a downloaded system profile with your support ticket is always recommended.

 
Nagios xi - System Info
System
Nagios xi version: 5.6.14
Release info: den-nagios.unitrendscloud.com 3.10.0-1062.18.1.el7.x86_64 x86_64
CentOS Linux release 7.7.1908 (Core)
Gnome is not installed
Apache Information
PHP Version: 5.4.16
Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36
Server Name: den-nagios.unitrendscloud.com
Server Address: 10.201.255.14
Server Port: 443
Date/Time
PHP Timezone: UTC
PHP Time: Wed, 28 Apr 2021 17:33:52 +0000
System Time: Wed, 28 Apr 2021 17:33:52 +0000
Nagios xi Data
License ends in: VUVOSM
UUID: f6dc7d1c-b9b0-44f5-892c-d7fd0a105bd1
Install Type: manual/unknown

└─48011 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
└─1599 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
└─47262 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
CPU Load 15: 2.34
Total Hosts: 2496
Total Services: 76811

Function get_base_uri() returns: https://den-nagios.unitrendscloud.com/nagiosxi/
Function get_base_url() returns: https://den-nagios.unitrendscloud.com/nagiosxi/
Function get_backend_url(internal_call=false) returns: https://den-nagios.unitrendscloud.com/nagiosxi/includes/components/profile/profile.php
Function get_backend_url(internal_call=true) returns: https://localhost/nagiosxi/backend/

Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1 
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.082 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.047 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.053 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.047/0.060/0.082/0.017 ms
Test wget To localhost
WGET From URL: https://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget https://localhost/nagiosxi/includes/components/ccm/ 
--2021-04-28 17:33:54-- https://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:443... connected.
ERROR: cannot verify localhost's certificate, issued by '/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA':
Unable to locally verify the issuer's authority.
ERROR: no certificate subject alternative name matches
requested host name 'localhost'.
To connect to localhost insecurely, use `--no-check-certificate'.
Network Settings
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

       valid_lft forever preferred_lft forever

    inet6 ::1/128 scope host 

       valid_lft forever preferred_lft forever

2: ens192:  mtu 1500 qdisc mq state UP group default qlen 1000

    link/ether 00:50:56:ba:3a:4a brd ff:ff:ff:ff:ff:ff

    inet 10.201.255.14/16 brd 10.201.255.255 scope global noprefixroute ens192

       valid_lft forever preferred_lft forever

    inet6 fe80::250:56ff:feba:3a4a/64 scope link 

       valid_lft forever preferred_lft forever

3: ens224:  mtu 1500 qdisc mq state UP group default qlen 1000

    link/ether 00:50:56:ba:83:03 brd ff:ff:ff:ff:ff:ff

4: ens256:  mtu 1500 qdisc mq state UP group default qlen 1000

    link/ether 00:50:56:ba:1c:43 brd ff:ff:ff:ff:ff:ff

    inet 10.163.31.207/23 brd 10.163.31.255 scope global noprefixroute ens256

       valid_lft forever preferred_lft forever

    inet6 fe80::250:56ff:feba:1c43/64 scope link 

       valid_lft forever preferred_lft forever


default via 10.201.0.1 dev ens192 proto static metric 100 

10.7.3.148 via 10.163.31.1 dev ens256 proto static 

10.163.30.0/23 dev ens256 proto kernel scope link src 10.163.31.207 metric 102 

10.201.0.0/16 dev ens192 proto kernel scope link src 10.201.255.14 metric 100 

10.206.0.0/16 via 10.201.255.230 dev ens192 proto static metric 100 


Nagios xi Components
actions	2.2.2
alertcloud	1.2.1
alertstream	2.1.1
autodiscovery	2.2.6
backendapiurl	1.0.5
bandwidthreport	1.8.1
bbmap	1.2.1
birdseye	3.2.4
bulkmodifications	2.2.0
capacityplanning	2.3.0
ccm	3.0.5
custom-includes	1.0.5
customlogin	1.0.0
customlogo	1.2.0
deploydashboard	1.3.0
deploynotification	1.3.3
duo	1.0.2
escalationwizard	1.5.1
freevariabletab	1.1.0
globaleventhandler	1.3.0
googlemap	1.6.2
graphexplorer	2.3.0
helpsystem	2.0.1
highcharts	
homepagemod	1.1.11
hypermap	1.2.1
hypermap_replay	1.2.0
isms	1.2.3
latestalerts	1.2.7
ldap_ad_integration	1.1.2
map	1.0.0
massacknowledge	2.2.2
massimmediatecheck	1.0.2
metrics	1.3.4
minemap	1.2.5
msp	1.2.0
mtr	1.0.2
nagiosbpi	2.8.3
nagioscore	
nagioscorecfg	
nagiosim	2.2.7
nagiosna	1.4.1
nagiosql	
nagvis	2.0.4
nocscreen	1.3.3
nrdsconfigmanager	1.6.8
nxti	1.0.3
opscreen	1.8.0
perfdata	
pingaction	1.1.2
pnp	
profile	1.4.1
proxy	1.1.5
rdp	1.0.5
rename	1.7.0
scheduledbackups	1.2.0
scheduledreporting	
similetimeline	1.5.1
snmptrapsender	1.6.2
statusmap	1.0.3
tracerouteaction	1.1.2
twilio	1.0.0
usermacros	1.1.0
xicore	
Nagios xi Config Wizards
activedirectory	1.3.4
ec2	1.1.3
s3	1.1.2
java_tomcat	1.1.0
autodiscovery	1.4.2
bpiwizard	1.1.5
bulkhostimport	2.1.3
capacity-planning	1.0.1
dhcp	1.1.6
dnsquery	1.1.5
digitalocean	1.0.2
docker	1.1.2
domain_expiration	1.1.6
email-delivery	2.0.5
esensors_websensor	1.1.6
exchange	1.3.3
ftpserver	1.5.7
folder_watch	1.0.6
genericnetdevice	1.0.4
java_glassfish	1.1.0
google-cloud	1.0.2
hyperv	1.0.2
java_jboss	1.1.0
java_jetty	1.1.0
ldapserver	1.3.4
linode	1.0.2
linux_snmp	1.5.8
linux-server	1.5.8
mssql_database	1.6.4
mssql_query	1.6.7
mssql_server	1.9.2
macosx	1.3.3
mailserver	1.2.6
microsoft-azure	1.0.2
mongodb_database	1.1.4
mongodbserver	1.1.4
mountpoint	1.0.3
mysqlquery	1.2.4
mysqlserver	1.3.4
ncpa	2.2.4
nrpe	1.5.3
nagioslogserver	1.0.7
nna	1.0.7
nagiosxiserver	1.3.2
nagiostats	1.2.3
switch	2.5.2
oraclequery	1.3.8
oracleserverspace	1.5.8
oracletablespace	1.5.9
passivecheck	1.2.5
postgresdb	1.5.4
postgresquery	1.2.4
postgresserver	1.3.5
printer	1.1.4
radiusserver	2.0.3
rackspace	1.0.2
sla	1.3.4
snmp	1.6.5
snmp_trap	1.5.4
snmpwalk	2.0.0
sshproxy	1.5.8
solaris	1.3.2
tcpudpport	1.3.4
tftp	1.0.3
passiveobject	1.1.3
vmware	1.7.3
watchguard	1.4.6
webtransaction	1.2.6
java_weblogic	1.1.0
website	1.4.1
website_defacement	1.2.2
websiteurl	1.4.0
windowsdesktop	1.6.4
windowseventlog	2.0.1
windowssnmp	1.5.6
windowsserver	1.6.4
windowswmi	2.2.0
Nagios xi Dashlets
alertcloud	
bbmap	
capacityplanning	
graphexplorer	
hypermap	
latestalerts	
metrics	
metricsguage	
minemap	
xicore_xi_news_feed	
xicore_getting_started	
xicore_admin_tasks	
xicore_eventqueue_chart	
xicore_component_status	
xicore_server_stats	
xicore_monitoring_stats	
xicore_monitoring_perf	
xicore_monitoring_process	
xicore_perfdata_chart	
xicore_host_status_summary	
xicore_service_status_summary	
xicore_comments	
xicore_hostgroup_status_overview	
xicore_hostgroup_status_grid	
xicore_servicegroup_status_overview	
xicore_servicegroup_status_grid	
xicore_hostgroup_status_summary	
xicore_servicegroup_status_summary	
xicore_available_updates	
xicore_network_outages	
xicore_network_outages_summary	
xicore_network_health	
xicore_host_status_tac_summary	
xicore_service_status_tac_summary	
xicore_feature_status_tac_summary	
availability	
custom_dashlet	1.0.6
gauges	1.2.2
googlemapdashlet	1.1.0
internettrafficreport	
rss_dashlet	1.1.3
sansrisingports	2.0
sla	
worldtimeserver	2.0.0
dchurch
Posts: 858
Joined: Wed Oct 07, 2020 12:46 pm
Location: Yo mama

Re: xi instance took a nosedive

Post by dchurch »

What version of Nagios xi are you running? I ask because you mention ndo2db. ndo2db, (sometimes called the Database Backend) is no longer needed in newer (>=5.7.0) versions of Nagios xi.

ndo2db is our older technology that basically listens on a UNIX socket for database inserts, then handles the actual insertion into the database. It has limits, being that it runs into issues when it tries to insert more than the database can handle. In newer versions (Nagios xi 5.7.0 and later), this was replaced by just writing directly to the database from the Nagios worker threads. In addition to being able to handle more database inserts, this resulted in an overall performance boost, too.

What's the output from the following command?

Code: Select all

mysql -uroot -pnagiosxi --table <<< 'select * from (select table_name, round(((data_length + index_length) / 1024 / 1024), 2) as sz from information_schema.tables where table_schema like '\''nagios%'\'') as x order by x.sz;'
If you didn't get an 8% raise over the course of the pandemic, you took a pay cut.

Discussion of wages is protected speech under the National Labor Relations Act, and no employer can tell you you can't disclose your pay with your fellow employees.
cbeattie-unitrends
Posts: 84
Joined: Mon Oct 10, 2016 2:51 pm

Re: xi instance took a nosedive

Post by cbeattie-unitrends »

The version is 5.6.14. Here's the output of the command. An update is definitely in order, but the size on the nagios_statehistory table sure does look suspicious right now.

Code: Select all

+--------------------------------------------+--------+
| table_name                                 | sz     |
+--------------------------------------------+--------+
| nagios_statehistory                        |   NULL |
| nagios_programstatus                       |   0.00 |
| tbl_mainmenu                               |   0.00 |
| tbl_lnkContacttemplateToContacttemplate    |   0.00 |
| tbl_lnkHosttemplateToHosttemplate          |   0.00 |
| nagios_contactgroups                       |   0.00 |
| tbl_lnkServiceescalationToContact          |   0.00 |
| tbl_lnkHostescalationToContactgroup        |   0.00 |
| tbl_logbook                                |   0.00 |
| tbl_lnkContacttemplateToContactgroup       |   0.00 |
| tbl_lnkHosttemplateToHostgroup             |   0.00 |
| nagios_contactgroup_members                |   0.00 |
| nagios_hostescalation_contactgroups        |   0.00 |
| tbl_lnkHostescalationToContact             |   0.00 |
| nagios_systemcommands                      |   0.00 |
| nagios_hostdependencies                    |   0.00 |
| tbl_lnkHostdependencyToHostgroup_H         |   0.00 |
| tbl_lnkTimeperiodToTimeperiod              |   0.00 |
| tbl_lnkContacttemplateToCommandService     |   0.00 |
| nagios_timeperiods                         |   0.00 |
| tbl_lnkHosttemplateToHost                  |   0.00 |
| nagios_contact_addresses                   |   0.00 |
| tbl_lnkServicedependencyToService_S        |   0.00 |
| nagios_hostchecks                          |   0.00 |
| tbl_lnkHostdependencyToHostgroup_DH        |   0.00 |
| tbl_lnkServicetemplateToVariabledefinition |   0.00 |
| tbl_lnkContacttemplateToCommandHost        |   0.00 |
| nagios_timeperiod_timeranges               |   0.00 |
| tbl_lnkHosttemplateToContactgroup          |   0.00 |
| tbl_timedefinition                         |   0.00 |
| tbl_lnkServicedependencyToService_DS       |   0.00 |
| nagios_host_parenthosts                    |   0.00 |
| tbl_lnkHostdependencyToHost_H              |   0.00 |
| tbl_lnkContactgroupToContactgroup          |   0.00 |
| nagios_timedevents                         |   0.00 |
| tbl_lnkHosttemplateToContact               |   0.00 |
| tbl_lnkServicedependencyToHostgroup_H      |   0.00 |
| nagios_host_contacts                       |   0.00 |
| tbl_lnkHostdependencyToHost_DH             |   0.00 |
| nagios_servicegroups                       |   0.00 |
| nagios_instances                           |   0.00 |
| tbl_lnkServicetemplateToServicegroup       |   0.00 |
| tbl_lnkContactgroupToContact               |   0.00 |
| nagios_timedeventqueue                     |   0.00 |
| tbl_lnkHostgroupToHostgroup                |   0.00 |
| tbl_submenu                                |   0.00 |
| tbl_lnkHostToVariabledefinition            |   0.00 |
| nagios_servicegroup_members                |   0.00 |
| tbl_lnkServicetemplateToHostgroup          |   0.00 |
| tbl_lnkContactToVariabledefinition         |   0.00 |
| tbl_lnkHostgroupToHost                     |   0.00 |
| nagios_configfiles                         |   0.00 |
| tbl_settings                               |   0.00 |
| tbl_lnkServicedependencyToHostgroup_DH     |   0.00 |
| tbl_lnkServicetemplateToHost               |   0.00 |
| tbl_lnkContactToContacttemplate            |   0.00 |
| nagios_comments                            |   0.00 |
| tbl_session_locks                          |   0.00 |
| tbl_lnkServicedependencyToHost_H           |   0.00 |
| nagios_serviceescalations                  |   0.00 |
| tbl_lnkServicetemplateToContactgroup       |   0.00 |
| tbl_lnkContactToContactgroup               |   0.00 |
| tbl_session                                |   0.00 |
| tbl_lnkServicedependencyToHost_DH          |   0.00 |
| nagios_serviceescalation_contacts          |   0.00 |
| nagios_hostgroups                          |   0.00 |
| tbl_hostextinfo                            |   0.00 |
| nagios_eventhandlers                       |   0.00 |
| tbl_lnkHostToHost                          |   0.00 |
| nagios_serviceescalation_contactgroups     |   0.00 |
| tbl_lnkServicetemplateToContact            |   0.00 |
| tbl_lnkContactToCommandService             |   0.00 |
| tbl_lnkServiceToServicegroup               |   0.00 |
| tbl_lnkHostToContactgroup                  |   0.00 |
| nagios_servicedependencies                 |   0.00 |
| nagios_hostescalations                     |   0.00 |
| tbl_lnkServicegroupToServicegroup          |   0.00 |
| tbl_lnkContactToCommandHost                |   0.00 |
| tbl_lnkServiceToHostgroup                  |   0.00 |
| tbl_serviceextinfo                         |   0.00 |
| nagios_dbversion                           |   0.00 |
| tbl_lnkHostToContact                       |   0.00 |
| nagios_servicechecks                       |   0.00 |
| nagios_hostescalation_contacts             |   0.00 |
| tbl_lnkServicegroupToService               |   0.00 |
| tbl_hostdependency                         |   0.00 |
| nagios_service_parentservices              |   0.00 |
| tbl_lnkServiceescalationToService          |   0.00 |
| tbl_lnkServiceescalationToHostgroup        |   0.00 |
| tbl_lnkServiceToContactgroup               |   0.00 |
| tbl_servicedependency                      |   0.00 |
| nagios_contacts                            |   0.00 |
| tbl_lnkServiceescalationToHost             |   0.00 |
| tbl_lnkHostescalationToHostgroup           |   0.00 |
| nagios_scheduleddowntime                   |   0.00 |
| tbl_lnkServiceToContact                    |   0.00 |
| nagios_contactstatus                       |   0.00 |
| tbl_lnkHostescalationToHost                |   0.00 |
| nagios_runtimevariables                    |   0.00 |
| tbl_lnkContacttemplateToVariabledefinition |   0.00 |
| tbl_lnkHosttemplateToVariabledefinition    |   0.00 |
| tbl_lnkServiceescalationToContactgroup     |   0.00 |
| tbl_contact                                |   0.01 |
| tbl_variabledefinition                     |   0.01 |
| nagios_contact_notificationcommands        |   0.01 |
| tbl_user                                   |   0.01 |
| tbl_timeperiod                             |   0.01 |
| xi_sysstat                                 |   0.01 |
| tbl_lnkServicetemplateToServicetemplate    |   0.01 |
| nagios_conninfo                            |   0.01 |
| nagios_configfilevariables                 |   0.01 |
| xi_users                                   |   0.01 |
| tbl_hostgroup                              |   0.01 |
| nagios_externalcommands                    |   0.01 |
| tbl_lnkServiceToVariabledefinition         |   0.01 |
| tbl_hostescalation                         |   0.01 |
| tbl_servicegroup                           |   0.01 |
| xi_commands                                |   0.01 |
| tbl_hosttemplate                           |   0.01 |
| tbl_serviceescalation                      |   0.01 |
| tbl_domain                                 |   0.01 |
| tbl_contacttemplate                        |   0.01 |
| tbl_contactgroup                           |   0.01 |
| tbl_lnkServicedependencyToServicegroup_S   |   0.02 |
| tbl_lnkServicedependencyToServicegroup_DS  |   0.02 |
| tbl_lnkServiceToServicetemplate            |   0.02 |
| xi_incidents                               |   0.02 |
| tbl_lnkServiceescalationToServicegroup     |   0.02 |
| tbl_lnkServiceToHost                       |   0.02 |
| tbl_permission                             |   0.02 |
| tbl_permission_inactive                    |   0.02 |
| xi_sessions                                |   0.03 |
| nagios_commands                            |   0.03 |
| tbl_servicetemplate                        |   0.03 |
| xi_auth_tokens                             |   0.03 |
| tbl_command                                |   0.04 |
| xi_options                                 |   0.04 |
| xi_mibs                                    |   0.05 |
| tbl_lnkHostToHostgroup                     |   0.06 |
| tbl_lnkHostToHosttemplate                  |   0.07 |
| xi_usermeta                                |   0.07 |
| tbl_service                                |   0.08 |
| nagios_host_contactgroups                  |   0.10 |
| nagios_hostgroup_members                   |   0.10 |
| xi_auditlog                                |   0.11 |
| xi_cmp_trapdata_log                        |   0.11 |
| nagios_contactnotificationmethods          |   0.12 |
| tbl_info                                   |   0.13 |
| nagios_contactnotifications                |   0.13 |
| nagios_customvariables                     |   0.19 |
| nagios_acknowledgements                    |   0.35 |
| nagios_processevents                       |   0.38 |
| tbl_host                                   |   0.38 |
| nagios_customvariablestatus                |   0.51 |
| xi_cmp_trapdata                            |   0.52 |
| nagios_service_contactgroups               |   0.62 |
| nagios_hosts                               |   0.63 |
| xi_events                                  |   0.83 |
| xi_eventqueue                              |   0.83 |
| nagios_service_contacts                    |   0.93 |
| nagios_hoststatus                          |   1.43 |
| nagios_services                            |   8.43 |
| nagios_notifications                       |  11.57 |
| nagios_objects                             |  11.71 |
| nagios_downtimehistory                     |  12.56 |
| xi_meta                                    |  15.68 |
| nagios_flappinghistory                     |  24.88 |
| nagios_servicestatus                       |  41.25 |
| nagios_commenthistory                      |  72.04 |
| nagios_logentries                          | 228.79 |
+--------------------------------------------+--------+
cbeattie-unitrends
Posts: 84
Joined: Mon Oct 10, 2016 2:51 pm

Re: xi instance took a nosedive

Post by cbeattie-unitrends »

I took a snapshot of the VM, stopped Nagios xi, ran a database repair so that nagios_statehistory was 0 instead of NULL, and then ran the script to upgrade to 5.8.3. The upgrade went smoothly. However, within a few minutes, nagios_statehistory has gone back to NULL and ipcs -q just shows the message queue is full again.

Code: Select all

+--------------------------------------------+--------+
| table_name                                 | sz     |
+--------------------------------------------+--------+
| nagios_statehistory                        |   NULL |
| nagios_hostescalation_contactgroups        |   0.00 |
| tbl_lnkContacttemplateToContactgroup       |   0.00 |
dchurch
Posts: 858
Joined: Wed Oct 07, 2020 12:46 pm
Location: Yo mama

Re: xi instance took a nosedive

Post by dchurch »

If you PM me a system profile I can diagnose further. Get one by going to Admin (top menu) => System Profile (in the left menu), then clicking the blue button.

If you're unable to generate the the profile through the web interface, please try generating it from the command line by running these commands as root:

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile*
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file.
If the profile script fails, please include the ENTIRE output.
If you didn't get an 8% raise over the course of the pandemic, you took a pay cut.

Discussion of wages is protected speech under the National Labor Relations Act, and no employer can tell you you can't disclose your pay with your fellow employees.
dchurch
Posts: 858
Joined: Wed Oct 07, 2020 12:46 pm
Location: Yo mama

Re: xi instance took a nosedive

Post by dchurch »

Your /var/nagiosramdisk is 100% full. Try either increasing the memory allocated to /var/nagiosramdisk, disabling the ram disk, or manually removing some of the files there.

Honestly, you may be able to get better performance on the perfdata inserts if you upgrade to Nagios xi 5.7.0 or later.
If you didn't get an 8% raise over the course of the pandemic, you took a pay cut.

Discussion of wages is protected speech under the National Labor Relations Act, and no employer can tell you you can't disclose your pay with your fellow employees.
cbeattie-unitrends
Posts: 84
Joined: Mon Oct 10, 2016 2:51 pm

Re: xi instance took a nosedive

Post by cbeattie-unitrends »

I doubled the size of the RAM disk to 1GB. The system is still struggling to run:

Code: Select all

[root@den-nagios ~]# df -Ph | grep nagios
tmpfs                1.0G  308M  717M  31% /var/nagiosramdisk
[root@den-nagios ~]# uptime
 18:32:42 up  5:34,  5 users,  load average: 128.47, 457.47, 649.16
This is after running the database repair script, so none of the tables right now are showing up in /var/log/mariadb/mariadb.log as being crashed.

I already upgraded to Nagios xi 5.8.3, which I noted on April 29th. I take it now that should have caused there to be nothing any more to display from ipcs -q? There may have been a queue left over from before. There is nothing shown by ipcs -q now. There have been reboots since the upgrade.
dchurch
Posts: 858
Joined: Wed Oct 07, 2020 12:46 pm
Location: Yo mama

Re: xi instance took a nosedive

Post by dchurch »

Looks like mysqld is the culprit here. What's the output from this command?

Code: Select all

mysql -uroot -pnagiosxi -e "show full processlist;"
Something else you should do:

On long-running systems with mucho checks, the database can get bogged down with excessive "paper trail" type data and the software's database queries aren't properly utilizing indexes. It just needs better thresholds to get performance back where it should be:

Open Admin => Performance Settings, then click on the Databases tab. Change the following settings:
- Max Log Entries Age: change to 10
- Max Audit Log Age: change to 10
- Max State History Age: change to 30

It might take up to a day for the "cleaner" process to run depending on how your system is configured, but it'll eventually run and clean your database of all these for you.
If you didn't get an 8% raise over the course of the pandemic, you took a pay cut.

Discussion of wages is protected speech under the National Labor Relations Act, and no employer can tell you you can't disclose your pay with your fellow employees.
cbeattie-unitrends
Posts: 84
Joined: Mon Oct 10, 2016 2:51 pm

Re: xi instance took a nosedive

Post by cbeattie-unitrends »

I've updated the Max Audit Log Age to 10. The other two logs you mentioned I had already cut down to 7.

The raw output from the command the first time I ran it was about 15K, which seemed like a bit much to paste in here. I've attached it to this update instead.

For no reason that I can discern, the load average has plummeted from the multiple hundreds to less than one recently (like, within a few minutes). The process list is also a lot tidier right now, so here's the output from a moment ago:

Code: Select all

[root@den-nagios ~]# uptime; mysql -uroot -pnagiosxi -e "show full processlist;"
 14:22:56 up 1 day,  1:25,  5 users,  load average: 0.97, 0.94, 19.40
+--------+----------+-----------+----------+---------+------+--------------+-----------------------------------------------------------------------+----------+
| Id     | User     | Host      | db       | Command | Time | State        | Info                                                                  | Progress |
+--------+----------+-----------+----------+---------+------+--------------+-----------------------------------------------------------------------+----------+
| 285399 | nagiosxi | localhost | nagiosxi | Sleep   | 3286 |              | NULL                                                                  |    0.000 |
| 285400 | ndoutils | localhost | nagios   | Sleep   | 3286 |              | NULL                                                                  |    0.000 |
| 285401 | nagiosql | localhost | nagiosql | Sleep   | 3286 |              | NULL                                                                  |    0.000 |
| 286194 | nagiosxi | localhost | nagiosxi | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 286195 | ndoutils | localhost | nagios   | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 286196 | nagiosql | localhost | nagiosql | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 286704 | nagiosxi | localhost | nagiosxi | Sleep   |   29 |              | NULL                                                                  |    0.000 |
| 286705 | ndoutils | localhost | nagios   | Sleep   |   29 |              | NULL                                                                  |    0.000 |
| 286706 | nagiosql | localhost | nagiosql | Sleep   |   29 |              | NULL                                                                  |    0.000 |
| 286809 | nagiosxi | localhost | nagiosxi | Sleep   |   16 |              | NULL                                                                  |    0.000 |
| 286810 | ndoutils | localhost | nagios   | Sleep   |   16 |              | NULL                                                                  |    0.000 |
| 286811 | nagiosql | localhost | nagiosql | Sleep   |   16 |              | NULL                                                                  |    0.000 |
| 287391 | nagiosxi | localhost | nagiosxi | Sleep   |   23 |              | NULL                                                                  |    0.000 |
| 287392 | ndoutils | localhost | nagios   | Sleep   |   23 |              | NULL                                                                  |    0.000 |
| 287393 | nagiosql | localhost | nagiosql | Sleep   |   23 |              | NULL                                                                  |    0.000 |
| 287474 | nagiosxi | localhost | nagiosxi | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287475 | ndoutils | localhost | nagios   | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287476 | nagiosql | localhost | nagiosql | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287568 | nagiosxi | localhost | nagiosxi | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287569 | ndoutils | localhost | nagios   | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287570 | nagiosql | localhost | nagiosql | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287696 | nagiosxi | localhost | nagiosxi | Sleep   |   11 |              | NULL                                                                  |    0.000 |
| 287697 | ndoutils | localhost | nagios   | Sleep   |   11 |              | NULL                                                                  |    0.000 |
| 287698 | nagiosql | localhost | nagiosql | Sleep   |   11 |              | NULL                                                                  |    0.000 |
| 287699 | nagiosxi | localhost | nagiosxi | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287700 | ndoutils | localhost | nagios   | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287701 | nagiosql | localhost | nagiosql | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287752 | nagiosxi | localhost | nagiosxi | Sleep   |   29 |              | NULL                                                                  |    0.000 |
| 287753 | ndoutils | localhost | nagios   | Sleep   |   28 |              | NULL                                                                  |    0.000 |
| 287754 | nagiosql | localhost | nagiosql | Sleep   |   29 |              | NULL                                                                  |    0.000 |
| 287838 | nagiosxi | localhost | nagiosxi | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287839 | nagiosxi | localhost | nagiosxi | Sleep   |    7 |              | NULL                                                                  |    0.000 |
| 287840 | ndoutils | localhost | nagios   | Sleep   |    7 |              | NULL                                                                  |    0.000 |
| 287841 | ndoutils | localhost | nagios   | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287842 | nagiosql | localhost | nagiosql | Sleep   |    7 |              | NULL                                                                  |    0.000 |
| 287843 | nagiosql | localhost | nagiosql | Sleep   |   20 |              | NULL                                                                  |    0.000 |
| 287852 | nagiosxi | localhost | nagiosxi | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287853 | ndoutils | localhost | nagios   | Sleep   |   10 |              | NULL                                                                  |    0.000 |
| 287854 | nagiosql | localhost | nagiosql | Sleep   |   55 |              | NULL                                                                  |    0.000 |
| 287855 | nagiosxi | localhost | nagiosxi | Sleep   |    0 |              | NULL                                                                  |    0.000 |
| 287856 | nagiosxi | localhost | nagiosxi | Sleep   |   13 |              | NULL                                                                  |    0.000 |
| 287857 | ndoutils | localhost | nagios   | Sleep   |   55 |              | NULL                                                                  |    0.000 |
| 287858 | ndoutils | localhost | nagios   | Sleep   |   55 |              | NULL                                                                  |    0.000 |
| 287859 | nagiosql | localhost | nagiosql | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287860 | nagiosql | localhost | nagiosql | Sleep   |   55 |              | NULL                                                                  |    0.000 |
| 287861 | nagiosxi | localhost | nagiosxi | Sleep   |    4 |              | NULL                                                                  |    0.000 |
| 287862 | ndoutils | localhost | nagios   | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287863 | nagiosql | localhost | nagiosql | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287870 | nagiosxi | localhost | nagiosxi | Query   |    0 | Sending data | SELECT * FROM xi_meta WHERE metatype_id='1' AND metaobj_id='45735978' |    0.000 |
| 287871 | ndoutils | localhost | nagios   | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287872 | nagiosql | localhost | nagiosql | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287873 | nagiosxi | localhost | nagiosxi | Sleep   |    1 |              | NULL                                                                  |    0.000 |
| 287874 | ndoutils | localhost | nagios   | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287875 | nagiosql | localhost | nagiosql | Sleep   |   54 |              | NULL                                                                  |    0.000 |
| 287884 | nagiosxi | localhost | nagiosxi | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287885 | ndoutils | localhost | nagios   | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287886 | nagiosql | localhost | nagiosql | Sleep   |    5 |              | NULL                                                                  |    0.000 |
| 287895 | root     | localhost | NULL     | Query   |    0 | NULL         | show full processlist                                                 |    0.000 |
+--------+----------+-----------+----------+---------+------+--------------+-----------------------------------------------------------------------+----------+
The nagios_statehistory table has gone back to size NULL, so the low load average is suspicious to me. The broken state is very responsive, at least!

Code: Select all

[nagios@den-nagios .ssh]$ mysql -uroot -pnagiosxi --table <<< 'select * from (select table_name, round(((data_length + index_length) / 1024 / 1024), 2) as sz from information_schema.tables where table_schema like '\''nagios%'\'') as x order by x.sz;'
+--------------------------------------------+--------+
| table_name                                 | sz     |
+--------------------------------------------+--------+
| nagios_statehistory                        |   NULL |
| tbl_lnkServicedependencyToHost_DH          |   0.00 |
| nagios_customvariables                     |   0.00 |
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: xi instance took a nosedive

Post by ssax »

Go to Admin > Performance Settings > Databases tab and set all three of the Optimize Intervals to 300.

Repair the DB tables again and run this command to clear out some temporary tables:

Code: Select all

echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -h 127.0.0.1 -uroot -pnagiosxi nagiosxi
What is the output of this command?

Code: Select all

ls -lh /var/lib/mysql/nagios