Slow NagiosXI VM

Fred Kroeger · Post by **Fred Kroeger** » Thu Dec 15, 2011 10:05 pm

My NagiosXI VM is performing poorly. Latency is over a minute now for Host & Service Checks.
I've done all the standard config changes to improve performance - but no difference.
I'm running the latest XI-1.8 VM on a ESXi 4.0 host.

I had a test XI-1.7 VM installed on the ESX Server - created a new XI-1.8 vm and imported the configs across. Admittedly there were a lot less hosts, but I did notice an increase in load usage between the two vm's.

We now have 4CPU's allocated to this VM but interestingly, it seems to perform the same if I reduce the CPU count to 2 or 3.

Most noticeable in poor performance is graph Explorer - it takes 45 secs to bring up the first page.

Code: Select all

Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/nagiosramdisk/status.dat
Status File Age:                        0d 0h 0m 9s
Status File Version:                    3.2.3

Program Running Time:                   0d 0h 25m 53s
Nagios PID:                             9434
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         5322
Services Checked:                       5322
Services Scheduled:                     5322
Services Actively Checked:              5322
Services Passively Checked:             0
Total Service State Change:             0.000 / 42.760 / 0.061 %
Active Service Latency:                 0.357 / 142.563 / 89.404 sec
Active Service Execution Time:          0.048 / 60.023 / 1.257 sec
Active Service State Change:            0.000 / 42.760 / 0.061 %
Active Services Last 1/5/15/60 min:     575 / 3344 / 5251 / 5318
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              5090 / 48 / 67 / 117
Services Flapping:                      4
Services In Downtime:                   0

Total Hosts:                            654
Hosts Checked:                          654
Hosts Scheduled:                        654
Hosts Actively Checked:                 654
Host Passively Checked:                 0
Total Host State Change:                0.000 / 23.680 / 0.062 %
Active Host Latency:                    0.000 / 121.324 / 86.525 sec
Active Host Execution Time:             0.039 / 10.374 / 0.231 sec
Active Host State Change:               0.000 / 23.680 / 0.062 %
Active Hosts Last 1/5/15/60 min:        22 / 282 / 633 / 654
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  649 / 5 / 0
Hosts Flapping:                         1
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     71 / 438 / 1450
   Scheduled:                           44 / 295 / 1006
   On-demand:                           27 / 143 / 444
   Parallel:                            45 / 300 / 1019
   Serial:                              0 / 0 / 0
   Cached:                              26 / 138 / 431
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  765 / 3537 / 11250
   Scheduled:                           765 / 3537 / 11250
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

nagios.cfg

Code: Select all

[root@nagios-wp2 html]# cat /usr/local/nagios/etc/nagios.cfg
# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0


# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

###object_cache_file=/var/nagiosramdisk/objects.cache

# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
#service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
#host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk


# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg


# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static

# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services

# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler



# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
bare_update_check=0
cached_host_check_horizon=30
###cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
#check_for_updates=1
check_for_updates=0
#check_host_freshness=0
check_host_freshness=1
check_result_path=/usr/local/nagios/var/spool/checkresults
#check_result_reaper_frequency=10
check_result_reaper_frequency=5
#check_result_reaper_frequency=5
check_service_freshness=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
#debug_verbosity=1
debug_verbosity=0
enable_embedded_perl=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
external_command_buffer_slots=4096
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
###host_freshness_check_interval=90
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
max_check_result_file_age=3600
#max_check_result_reaper_time=30
max_check_result_reaper_time=15
#max_check_result_reaper_time=15
##max_concurrent_checks=0
max_concurrent_checks=0
#max_concurrent_checks=90
max_debug_file_size=1000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
#object_cache_file=/usr/local/nagios/var/objects.cache
object_cache_file=/var/nagiosramdisk/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
###perfdata_timeout=15
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
#status_file=/usr/local/nagios/var/status.dat
status_file=/var/nagiosramdisk/status.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
use_aggressive_host_checking=0
use_embedded_perl_implicitly=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0

mguthrie · Post by **mguthrie** » Fri Dec 16, 2011 10:36 am

Most noticeable in poor performance is graph Explorer - it takes 45 secs to bring up the first page.

Ouch! Gonna have to reexamine that code...

First, lets make sure the backends aren't having any problems, that will cause a HUGE performance hit.
http://assets.nagios.com/downloads/nagi ... tabase.pdf

Then lets clean up postgres

Code: Select all

psql nagiosxi nagiosxi
vacuum;
vacuum analyze;
vacuum full;
\q

It looks like you've already done some performance tuning on this machine. Make sure you're using the "unified" dashlets in the Admin->Performance Settings page, and try notching up the "Dashlet refresh multiplier" if you haven't already.

Fred Kroeger · Post by **Fred Kroeger** » Sun Dec 18, 2011 8:35 pm

All pages have already been set to Unified and dashlet multiplier is set at 2000

I've run the Repair previously ( a few times now) but ran it again now. It exits OK with a "Repair Complete".

I haven't run the postgres repair before, so I ran it just now. All vacuum statements report the same error :

Code: Select all

# psql nagiosxi nagiosxi
psql (8.4.9)

nagiosxi=> vacuum;
WARNING:  skipping "pg_database" --- only superuser can vacuum it
WARNING:  skipping "pg_authid" --- only superuser can vacuum it
WARNING:  skipping "pg_tablespace" --- only superuser can vacuum it
WARNING:  skipping "pg_pltemplate" --- only superuser can vacuum it
WARNING:  skipping "pg_shdepend" --- only superuser can vacuum it
WARNING:  skipping "pg_shdescription" --- only superuser can vacuum it
WARNING:  skipping "pg_auth_members" --- only superuser can vacuum it
VACUUM

I tried booting the OS to the previous version ( 2.6.32-71.29.1 ) and performance improved ( especially http). However after running it over the weekend, I discovered that it wasn't scheduling a lot of the services and wouldn't display the RRD graphs. So I have booted back to the current OS version ( 2.6.32-131.17.1 ) and all is running again but we're back to verrrryyyy slowwww......

BTW - As you can see from the config file, I am running a RAMDisk as well

Code: Select all

# ls -la /var/nagiosramdisk/
total 14584
drwxrwxrwt   2 root   root       120 Dec 19 09:29 .
drwxr-xr-x. 19 root   root      4096 Nov  7 15:54 ..
-rw-rw-r--   1 nagios users     4648 Dec 19 09:30 host-perfdata
-rw-r--r--   1 nagios nagios 5868597 Dec 19 09:09 objects.cache
-rw-rw-r--   1 nagios users    51212 Dec 19 09:30 service-perfdata
-rw-r--r--   1 nagios users  8967847 Dec 19 09:29 status.dat

Getting pretty desparate now as I need to run a demo with the client and I don't want to show it in current state or tell them "that it will get better in the next release".

regards... Fred

mguthrie · Post by **mguthrie** » Mon Dec 19, 2011 10:39 am

I just remembered, 1.8 had an updated version of the NPCD daemon underneath. If you've made any config changes under /usr/local/nagios/etc/pnp you may have to update them for the 1.8 install. I would check the /usr/local/nagios/var/perfdata.log and the /usr/local/nagios/var/npcd.log for any errors related to performance data. If there's a permissions issue that could create a big hit on CPU load on a larger install. Are your performance graphs updating ok on the 1.8 system?

I would also check the /usr/local/nagios/etc/pnp/process_perfdata.cfg and the /usr/local/nagios/etc/pnp/process_perfdata.cfg files and make sure logging is set to 0.

Fred Kroeger · Post by **Fred Kroeger** » Mon Dec 19, 2011 11:50 pm

I am thinking that most of my issues are a CPU resourcing problem from the ESX host.
Yes the performance graphs are all updating OK.
However a couple of interesting points. When I made some changes to the VM config & rebooted - everything ran well. Latency was down to 3 secs (previously 210secs).
I discovered later that the npcd lock file hadn't been deleted when it reboooted, so it failed to start. When I restarted npcd, performance dropped back down again.
So if npcd isn't running, Nagios performs quite well! Noticed that I was getting some errors in perfdata.log

Code: Select all

2011-12-20 12:30:01 [4993] [0] *** TIMEOUT: Timeout after 10 secs. ***
2011-12-20 12:30:01 [4993] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2011-12-20 12:30:01 [4993] [0] *** TIMEOUT: Please check your npcd.cfg
2011-12-20 12:30:01 [4993] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//service-perfdata.1324355379-PID-4993 deleted
2011-12-20 12:30:01 [4993] [0] *** Timeout while processing Host: "WP-7" Service: "WP-7_-_Check_Availability"
2011-12-20 12:30:01 [4993] [0] *** process_perfdata.pl terminated on signal ALRM

I've changed the sleep_time=10 (instead of 15) in npcd.cfg and those warnings are no longer appearing

Not sure if you're emphasing the following config file?

I would also check the /usr/local/nagios/etc/pnp/process_perfdata.cfg and the /usr/local/nagios/etc/pnp/process_perfdata.cfg files and make sure logging is set to 0.

however logging is set to 0

BTW the slow load of the graph Explorer page seems to be related to the number of Hosts/Services that need to be displayed

mguthrie · Post by **mguthrie** » Tue Dec 20, 2011 11:21 am

I am thinking that most of my issues are a CPU resourcing problem from the ESX host.

With a larger install like yours, the word in the street is a hardware box will still vastly outperform a VM if you really have to push your box. Take a look a the first few slides of this presentation:
http://exchange.nagios.org/directory/Mu ... ny/details

I've changed the sleep_time=10 (instead of 15) in npcd.cfg and those warnings are no longer appearing

Good to know for future reference. I might suggest also keeping an eye on the /usr/local/nagios/var/spool/perfdata directory and make sure those files are being cleaned up regularly. Note that if NPCD has been stopped for a while, it will have some catching up to do while it reaps those files, so you will notice a hit on CPU until it gets caught up.

BTW the slow load of the graph Explorer page seems to be related to the number of Hosts/Services that need to be displayed

Yeah, I might have to rethink that first graph that's displayed, and re-examine the data fetch for larger installs. I appreciate the heads up on the load times though, that's one of those things that doesn't show up very clearly in a test environment ; )

Fred Kroeger · Post by **Fred Kroeger** » Wed Dec 21, 2011 2:49 am

I started playing around with the npcd process - trying to renice it. But I inadvertently stopped it from running - however while npcd wasn't running, my average latency times went down to ~2secs.
Once I realised what I'd done and fixed it up, the latency went back through the roof >160secs.

You mentioned previously that you had made changes to the npcd daemon in 1.8 ? Can you confirm that it is working OK? On a lightly loaded box, you may not notice this issue?
The above difference in times are quite dramatic.

regards... Fred

mguthrie · Post by **mguthrie** » Wed Dec 21, 2011 11:04 am

Hi Fred,

We did actually test the new pnp update on a test box with about 4000 services running, and I had two other users with larger installs test the new version as well. Larger installs were actually having a problem with npcd crashing because of memory leak related to directory scans. The latest version is considered stable, so my guess is that the problem lies elsewhere.

I'm still concerned there might be a permissions problem somewhere with the performance graphs. Let's make sure those are taken care of:

Code: Select all

chmod -R +x /usr/local/nagios/share/perfdata
chown -R nagios.nagios /usr/local/nagios/share/perfdata
chown nagios.nagios /usr/local/nagios/var/service-perfdata
chown nagios.nagios /usr/local/nagios/var/host-perfdata
chmod 664 /usr/local/nagios/var/service-perfdata
chmod 664 /usr/local/nagios/var/host-perfdata
chown nagios.nagios /usr/local/nagios/var/spool/perfdata
chmod 775 /usr/local/nagios/var/spool/perfdata

Check the /usr/local/nagios/var/spool/perfdata directory for old files. If you have old/stale files in there I would suggest clearing them, and then make sure the directory is getting cleaned up every 10-20 seconds.

Fred Kroeger · Post by **Fred Kroeger** » Wed Dec 21, 2011 9:26 pm

I haven't had any issues with my perfomance graphs - however I ran the commands you listed to ensure that you are satisied that they are not causing any problems.
There has not been any differnce in host & service latency ~70secs.
Note that I have set the host-perfdata & service-perfdata files to be on a ramdisk. These files are being moved regularly to /usr/local/nagios/share/perfdata and from there the files are being cleared out regularly. There are no old/stale files in /usr/local/nagios/share/perfdata and the rrd files in /usr/local/nagios/share/perfdata are being updated OK .

I accept that hosting the Nagios server as an ESX VM is not optimal, but I find the performance degradation quite dramatic.
The only clear indication I have at the moment is that when I stop npcd, the latency starts dropping down to 1 sec.

mguthrie · Post by **mguthrie** » Thu Dec 22, 2011 11:27 am

I do agree that there is most likely a performance hang-up somewhere, and it's not just a matter of lacking physical hardware. Just for fun, lets try restoring the reaper defaults and see if that makes a difference. Sometimes those are a balancing act in terms of saving or burning up CPU, it's possible that those settings don't need to be tuned for your system.

Code: Select all

check_result_reaper_frequency=10
max_check_result_reaper_time=30

Do you have any error message in the nagios.log file that could reveal anything?

If the CPU load isn't through the roof when npcd is running, then there's something that's holding up the main nagios loop and increasing the latency. Do you have large amounts of notifications going out lately?

It might also be worth checking to make sure that you don't have multiple instance of nagios running, if you did this could cause a big increase in latency.

Code: Select all

service nagios stop
killall -9 nagios
service nagios start

Nagios Support Forum

Slow NagiosXI VM

Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM

Re: Slow NagiosXI VM