Is this normal?

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Is this normal?

Post by niebais »

daytonjones wrote:I'm currently evaluating XI (2009R1.1E.) and have seen a few issues - not sure if it's just my setup or what.

1) When viewing the "Tactical Overview" and clicking on a problem, nothing shows up. If I then click on the "Status Summary" desklet I can get to the problem resource.
2) If I leave my Nagios Core configs in etc/static, my several hundred hosts and several thousand services appear and seem ok. If I instead import the configs, only about 3/4 show up.
3) Most troubling, I removed the VM and re-installed and tried using the config wizards to create my configs from scratch. I've added 30 switches (nothing else) and the load on the box is steady at 30 (System CPU is >80%)

I set use_large_installation_tweaks=1 and restarted but it appeared to have no effect. Suggestions? What do I need to check?
I already verified this step. I didn't see any effect.
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Nagios XI without VM???

Post by niebais »

mmestnik wrote:The biggest issue is that the load average is not directly related to something worth our attention, like CPU temp or the RPM meter on a car. High values are simply an indicator, by themselves they are meaningless. They indicate that you !!might!! have something needlessly wasting CPU time. If you can't discover the process doing so, you should pat your server on the back for being such a hard worker and let it get back to making you money. Putting some of the load on another server is an option, but these days the reverse in happening and server load is being pooled into a single object.

The /smallest/ load average is 1min, in that time did only 1 out of 11 jobs that were in the run queue actually run? During this minute what were the 11 processes? Were they the same processes for the whole minute or did several CPU heavy jobs stop and different jobs replace them in the run queue during that time?

The biggest question is, aside from a longer login time did you notice any performance problems during your shell session?

One thing to watch out for is an ever claiming load average, I've never seen it. This is where the CPU load is some how self perpetuating. Once again there is going to be some process or chain of processes causing this.

It's wasteful if the value is not above one for most of the day, think of CPU time as a human resource. Can you afford to have an employee with nothing to do for most of the day?

On most systems several hundred applications can run through the time span of a whole minute and the Linux kernel might be able to scale to many many more. You did say this was a quad-core, so the load average is actually not 11 it's 3. 3 is a vary manageable number for a single core machine.

This whole CPU and Memory bean counting isn't ever about high or low, it's all about proper allocation of resources. If resources are not being wasted needlessly then feel better about using the next minutes CPU time for a job that was in the run queue this minute(It's not like we humans can detect a few hundred clock cycle delay at 1Ghz) and let memory exist in slower disk storage(most of the time more then a small amount of memory does not actively get used).
I think a more useful metric is when the box slows down and there is a high load, that's a pretty good indicator there's a problem. That's exactly what we're experiencing right now. As for the system and resources, we have plenty of resources, memory, etc. Our problem is with the GUI. Things run really well when people aren't logged in.
mmestnik
Posts: 972
Joined: Mon Feb 15, 2010 2:23 pm

Re: Need help with high load

Post by mmestnik »

We have noticed this as well. The current suggestion on the table is to schedule user ajax request serially. This would cause a complete outage for most users, until there request came to the top of the queue.

I see this as beneficial over the current, every one get's degraded service and no-one is happy. Though I'm alone in this thinking currently.

Once this is in place classifying requests into a Farness Queue ALA SFQ
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Need help with high load

Post by niebais »

Hmm, I don't think I like scheduling Ajax. One of the big benefits of the Nagios XI system is the performance graphs. We were exited about those because they provide information we need about our systems. I consider this problem a bug and it needs to be addressed. In my mind, there's no need for a page to do that much with the database.
mmestnik
Posts: 972
Joined: Mon Feb 15, 2010 2:23 pm

Re: Need help with high load

Post by mmestnik »

It might be easy to convince me of this. However we might intend our product to scale to wall street stock tickers environments with billions of Ajax clients. Given this the issue to tackle now is mitigating high load situations by enforcing soft and hard limits with throttling. This in the long run will provide greater benefit.

Slimming down the code would mean to reduce features. The process here would be to identify the features and classify them by resource usage. This alone will take some doing and things don't get easy from there. On the other hand there is something that can be done underneath the current solution, thus improving the foundation and allowing for a faster turn around.
tonyyarusso
Posts: 1128
Joined: Wed Mar 03, 2010 12:38 pm
Location: St. Paul, MN, USA

Re: Need help with high load

Post by tonyyarusso »

Neither of your understandings of load metrics are correct.

Load on a *nix system is a measurement of CPU usage, based on whether a request can run or has to wait first. The best way to understand it is with example numbers:
On a single-core system, a load of 1 means the processor is exactly at capacity.
On a single-core system, a load of 2 means the processor is being asked to do twice as much as it is capable of.
On any system, a load of 0 means all requests were able to run right away, meaning the processor was idle until that request came.
On a quad-core system, a load of 4 means the processor is exactly at capacity.

So, if you have any fewer that 11 cores in your system, a load of 11 is indeed a bad thing and indicates a performance problem that needs attention. In a 12-core system it would indicate that you should keep an eye on things for capacity planning purposes, but would not yet be causing problems.
Tony Yarusso
Technical Services
___
TIES
Web: http://ties.k12.mn.us/
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Need help with high load

Post by niebais »

tonyyarusso wrote:Neither of your understandings of load metrics are correct.

Load on a *nix system is a measurement of CPU usage, based on whether a request can run or has to wait first. The best way to understand it is with example numbers:
On a single-core system, a load of 1 means the processor is exactly at capacity.
On a single-core system, a load of 2 means the processor is being asked to do twice as much as it is capable of.
On any system, a load of 0 means all requests were able to run right away, meaning the processor was idle until that request came.
On a quad-core system, a load of 4 means the processor is exactly at capacity.

So, if you have any fewer that 11 cores in your system, a load of 11 is indeed a bad thing and indicates a performance problem that needs attention. In a 12-core system it would indicate that you should keep an eye on things for capacity planning purposes, but would not yet be causing problems.
Ok, I don't think you're understanding my problem. The main problem is that the system starts to run slowly when people have the performance graphs page up. Load or not, this is a serious problem. We're only monitoring 37 hosts with 113 services. In this case the "box performance" decreases and we notice some slowness on the system. To us this is a bug that needs to be addressed. I use load average as a metric to show how the system is doing. I realize that it doesn't mean that people will be experiencing problems, but in our case, even with our pretty big system, are experiencing issues.
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Need help with high load

Post by niebais »

mmestnik wrote:It might be easy to convince me of this. However we might intend our product to scale to wall street stock tickers environments with billions of Ajax clients. Given this the issue to tackle now is mitigating high load situations by enforcing soft and hard limits with throttling. This in the long run will provide greater benefit.

Slimming down the code would mean to reduce features. The process here would be to identify the features and classify them by resource usage. This alone will take some doing and things don't get easy from there. On the other hand there is something that can be done underneath the current solution, thus improving the foundation and allowing for a faster turn around.
We would like to get a one time graph when clicking on performance data. The problem is that one user can slow our entire system down by looking at the performance graphs page. The users here don't care if they are real time graphs or not. I suggest giving people the capability to turn off this functionality and make it so it shows 1 graph and doesn't keep updating from the Db unless you click on a "refresh" button or have it refresh every 30 seconds instead.

Maybe let us customize it in one of the .cfg files.
mmestnik
Posts: 972
Joined: Mon Feb 15, 2010 2:23 pm

Re: Need help with high load

Post by mmestnik »

niebais wrote:We would like to get a one time graph when clicking on performance data. The problem is that one user can slow our entire system down by looking at the performance graphs page. The users here don't care if they are real time graphs or not. I suggest giving people the capability to turn off this functionality and make it so it shows 1 graph and doesn't keep updating from the Db unless you click on a "refresh" button or have it refresh every 30 seconds instead.

Maybe let us customize it in one of the .cfg files.
I could put this in as a feature request, however I think this is fixed in 1.2. We will also look at this in future releases, don't misunderstand us this is a big to do item.
User avatar
niebais
Posts: 349
Joined: Tue Apr 13, 2010 2:15 pm

Re: Need help with high load

Post by niebais »

I just upgraded to version 1.2 just to make sure I was all up to date. I appreciate all that's been done so far. I realize troubleshooting load average problems can be somewhat difficult to troubleshoot.