threshold values for CPU Load Check

vlakshman · Post by **vlakshman** » Thu Dec 20, 2018 8:08 am

Team,

I am trying to set warn and critical CPU Load threshold values for Nagios server (Which is currently polling some 1000 services at polling interval of 1 minute) using check_load nagios plugin.

Threshold Calculation Formula:
y = c * p /100
where y --> Nagios Value, c --> Number of CPU cores, p --> Percent Threshold limit expected

My Nagios server has 8 processors with each having 8 CPU cores and hyper threading is enabled.
For WARN_Load = 0.8,0.8,0.75 and CRIT_Load = 0.9, 0.9, 0.85 I calculated Load limits as -w 51.2, 51.2, 48 -c 57.6, 57.6 54.4
But still Load gets Critical!

Any thoughts on how to handle would be highly appreciated!

bolson · Post by **bolson** » Thu Dec 20, 2018 10:41 am

Hello vlakshman,

It appears that you are entering the warning and critical thresholds as decimal numbers but the check command is expecting an integer.

Ie: 0.9 instead of 90. Try running the service check with integer percent values and see if you get the expected result.

Thank you for visiting the Nagios Support Forum.

vlakshman · Post by **vlakshman** » Mon Dec 24, 2018 9:21 am

Hi bolson,

Thanks for your feedback.

I am using check_load Nagios plugin for checking CPU Load.
Manual page Looks like it supports both integer and decimal values (I can see results when specifying integer or float value)
https://nagios-plugins.org/doc/man/check_load.html

Following the link below, I understand the following:

https://support.nagios.com/kb/article/l ... s-771.html

1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
2) If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.

Am using a c5.xlarge EC2 instance which has 8GB and vCPUs with hyper thread enabled.
https://www.ec2instances.info/?filter=c ... =c5.xlarge

(sudo cat /proc/cpuinfo say there are processor:0,1,2,3 and each has 2 cpu cores)

Check_load plugin vs uptime result mismatch:
Following is the threshold set for 90% WARN (1,5 and 15 min) and 95% CRITICAL (1,5 and 15 min).

Code: Select all

/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95

Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;

Code: Select all

 uptime

Output: 14:10:06 up 2 days, 23:03, 1 user, load average: 5.98, 5.67, 5.65

Questions:

1) check_load doesn't match with uptime result
2) Am I setting the right threshold?
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)

bolson · Post by **bolson** » Wed Dec 26, 2018 5:06 pm

Hello vlakshman,

To your question 1, I would suggest that you load average and uptime match as closely as one would expect if the checks aren't being performed at precisely the same time. And as is the in your example... if the 1 minute average is down, the 5 and 15 minute would also be down, but by a smaller amount.

To question 2, the "correct" thresholds are based on what you expect the load averages to be on your host. This can best be determined by comparing the load average to a cpu utilization check with a frequent interval, ie: 1 minute. Additionally, there is a wealth of information on Linux load average on the internet. I've included a link to my favorite document on the subject.

https://www.teamquest.com/import/pdfs/w ... ldavg1.pdf

To question 3, the four values for each average are 1) value returned by the check, 2) warning threshold, 3) critical threshold, 4) meaningless and can be ignored.

Let us know if thi answers your questions on this topic. Thank you!

npolovenko · Post by **npolovenko** » Wed Dec 26, 2018 5:10 pm

@vlakshman,

1) check_load doesn't match with uptime result

These outputs look almost identical:

/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
OK - load average: 6.06, 5.68, 5.66

uptime
load average: 5.98, 5.67, 5.65

1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.

1. Ideally, yes. Because for 1 core CPU threshold of 1 means its functioning on full capacity. But technically the load can go over 1. That means the core is overloaded.
http://blog.scoutapp.com/articles/2009/ ... d-averages

If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.

2. Correct. But normally you should set the threshold calculated based on the number of CPUs on the server. For example, for 4 CPUs the threshold of 4 would mean that all cores are working on a full capacity. You can set the threshold higher then 4 but the server would be already overloaded at that point.

2) Am I setting the right threshold?
-w 90,90,90 -c 95,95,95

For 8 core CPU i'd do -w 7,7,7 -c 8,8,8

3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
In this output:

Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;

Everything after the | sign is used internally by Nagios to build performance graphs. Once you import this check in the XI you will not be able to see values after the "|".

Nagios Support Forum

threshold values for CPU Load Check

threshold values for CPU Load Check

Re: threshold values for CPU Load Check

Re: threshold values for CPU Load Check

Re: threshold values for CPU Load Check

Re: threshold values for CPU Load Check