NagiosXI Server High Load

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Satyam
Posts: 63
Joined: Mon Oct 24, 2011 8:14 am

NagiosXI Server High Load

Post by Satyam »

Dear Support Team,

We have been running NagiosXI from a year to monitor our entire IT infrastructure which typically has many windows servers(NT, 2003, 2007), Exchange Servers, AD Servers, SCCM servers, Sharepoint Servers, ESX Servers & Several VM instances, many Unix servers (different flavours-HPUX, RHEL, AIX), Oracle and MSSQL Database servers, many website URLS & Portals, Network devices like Cisco Routers, Switches & Network Links and some storage boxes like EMC Clarion.

Infrastructure Details
Total Servers(Windows+Unix) : 489 (includes all - oracle, msssql, exchange, AD, SCCM, Sharepoint, SAP, ESX, VMs etc.)
Total URLs : 53
Routers : 167
Switches : 73
Network Links : 305 (as ICMP Ping Service)
Storage Devices : 11


Monitoring Checks Statistics
In total we have :-

794 active host checks
7922 active service checks

Server Statistics
All these we are doing with a single instance of NagiosXI on a VM host running CentOS 5.6 and with 4 Cores CPU & 8 GB of RAM.

Tweaks & Performance Enhancement Steps Done
1. I am already using the all large installation tweaks as given in the nagios & XI documents.
2. Offloaded MySQL to other VM instance running CentOS.
3. Enabling rrdcache already done.
4. Using a ramdisk - Not feasible for us.
5. Using gearmand to distribute all network host & service checks to an another server (all network host & service checks using SNMP executing from here and results submitted back to main NagiosXI server).

Additional Burden on NagiosXI Server
Apart from these I am also running snmptrapd & snmptt demon to get traps from network devices and show it in passive check results.

NagiosXI Server Load Statistics
Irrespective of all these I am facing a high load on my main NagiosXI server.
Host Check Latency (avg.) : 3000 secs (approx.)
Service Check Latency (avg.) : 3000 secs (approx.)

NagiosXI Server Load Average : 15 14 15 (1,5,15 minutes load)
CpuSystem Utilization %age remains around 80% constantly.
Average Memory Utilization is around 6 GB out of total 8 GB.


Please suggest how I can optimize more and maximize my NagiosXI server performance for zero-risk monitoring with lowest host & service check latency for pro-active incident/problem detection.

I have one more point in my mind can we offload the NPCD process for performance graphing to the other server, as it can save some critical load. Please share the steps if it can be done.
Thanks,
Sattanathan.S
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: NagiosXI Server High Load

Post by mguthrie »

If its feasible for you to do, I would consider moving to a physical hardware box. I think you're running into the limitations of a VM with it having to share a physical disk and CPU. If that's not an option right now, I would strongly recommend implementing the RAM disks for at least the status file and the check results.