Team,
I am trying to set warn and critical CPU Load threshold values for Nagios server (Which is currently polling some 1000 services at polling interval of 1 minute) using check_load nagios plugin.
Threshold Calculation Formula:
y = c * p /100
where y --> Nagios Value, c --> Number of CPU cores, p --> Percent Threshold limit expected
My Nagios server has 8 processors with each having 8 CPU cores and hyper threading is enabled.
For WARN_Load = 0.8,0.8,0.75 and CRIT_Load = 0.9, 0.9, 0.85 I calculated Load limits as -w 51.2, 51.2, 48 -c 57.6, 57.6 54.4
But still Load gets Critical!
Any thoughts on how to handle would be highly appreciated!
threshold values for CPU Load Check
Re: threshold values for CPU Load Check
Hello vlakshman,
It appears that you are entering the warning and critical thresholds as decimal numbers but the check command is expecting an integer.
Ie: 0.9 instead of 90. Try running the service check with integer percent values and see if you get the expected result.
Thank you for visiting the Nagios Support Forum.
It appears that you are entering the warning and critical thresholds as decimal numbers but the check command is expecting an integer.
Ie: 0.9 instead of 90. Try running the service check with integer percent values and see if you get the expected result.
Thank you for visiting the Nagios Support Forum.
-
- Posts: 27
- Joined: Tue Aug 21, 2018 11:03 am
Re: threshold values for CPU Load Check
Hi bolson,
Thanks for your feedback.
I am using check_load Nagios plugin for checking CPU Load.
Manual page Looks like it supports both integer and decimal values (I can see results when specifying integer or float value)
https://nagios-plugins.org/doc/man/check_load.html
Following the link below, I understand the following:
https://support.nagios.com/kb/article/l ... s-771.html
1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
2) If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.
Am using a c5.xlarge EC2 instance which has 8GB and vCPUs with hyper thread enabled.
https://www.ec2instances.info/?filter=c ... =c5.xlarge
(sudo cat /proc/cpuinfo say there are processor:0,1,2,3 and each has 2 cpu cores)
Check_load plugin vs uptime result mismatch:
Following is the threshold set for 90% WARN (1,5 and 15 min) and 95% CRITICAL (1,5 and 15 min).
Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;
Output: 14:10:06 up 2 days, 23:03, 1 user, load average: 5.98, 5.67, 5.65
Questions:
1) check_load doesn't match with uptime result
2) Am I setting the right threshold?
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
Thanks for your feedback.
I am using check_load Nagios plugin for checking CPU Load.
Manual page Looks like it supports both integer and decimal values (I can see results when specifying integer or float value)
https://nagios-plugins.org/doc/man/check_load.html
Following the link below, I understand the following:
https://support.nagios.com/kb/article/l ... s-771.html
1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
2) If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.
Am using a c5.xlarge EC2 instance which has 8GB and vCPUs with hyper thread enabled.
https://www.ec2instances.info/?filter=c ... =c5.xlarge
(sudo cat /proc/cpuinfo say there are processor:0,1,2,3 and each has 2 cpu cores)
Check_load plugin vs uptime result mismatch:
Following is the threshold set for 90% WARN (1,5 and 15 min) and 95% CRITICAL (1,5 and 15 min).
Code: Select all
/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
Code: Select all
uptime
Questions:
1) check_load doesn't match with uptime result
2) Am I setting the right threshold?
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
Re: threshold values for CPU Load Check
Hello vlakshman,
To your question 1, I would suggest that you load average and uptime match as closely as one would expect if the checks aren't being performed at precisely the same time. And as is the in your example... if the 1 minute average is down, the 5 and 15 minute would also be down, but by a smaller amount.
To question 2, the "correct" thresholds are based on what you expect the load averages to be on your host. This can best be determined by comparing the load average to a cpu utilization check with a frequent interval, ie: 1 minute. Additionally, there is a wealth of information on Linux load average on the internet. I've included a link to my favorite document on the subject.
https://www.teamquest.com/import/pdfs/w ... ldavg1.pdf
To question 3, the four values for each average are 1) value returned by the check, 2) warning threshold, 3) critical threshold, 4) meaningless and can be ignored.
Let us know if thi answers your questions on this topic. Thank you!
To your question 1, I would suggest that you load average and uptime match as closely as one would expect if the checks aren't being performed at precisely the same time. And as is the in your example... if the 1 minute average is down, the 5 and 15 minute would also be down, but by a smaller amount.
To question 2, the "correct" thresholds are based on what you expect the load averages to be on your host. This can best be determined by comparing the load average to a cpu utilization check with a frequent interval, ie: 1 minute. Additionally, there is a wealth of information on Linux load average on the internet. I've included a link to my favorite document on the subject.
https://www.teamquest.com/import/pdfs/w ... ldavg1.pdf
To question 3, the four values for each average are 1) value returned by the check, 2) warning threshold, 3) critical threshold, 4) meaningless and can be ignored.
Let us know if thi answers your questions on this topic. Thank you!
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: threshold values for CPU Load Check
@vlakshman,
http://blog.scoutapp.com/articles/2009/ ... d-averages
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
In this output:
These outputs look almost identical:1) check_load doesn't match with uptime result
/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
OK - load average: 6.06, 5.68, 5.66
uptime
load average: 5.98, 5.67, 5.65
1. Ideally, yes. Because for 1 core CPU threshold of 1 means its functioning on full capacity. But technically the load can go over 1. That means the core is overloaded.1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
http://blog.scoutapp.com/articles/2009/ ... d-averages
2. Correct. But normally you should set the threshold calculated based on the number of CPUs on the server. For example, for 4 CPUs the threshold of 4 would mean that all cores are working on a full capacity. You can set the threshold higher then 4 but the server would be already overloaded at that point.If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.
For 8 core CPU i'd do -w 7,7,7 -c 8,8,82) Am I setting the right threshold?
-w 90,90,90 -c 95,95,95
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
In this output:
Everything after the | sign is used internally by Nagios to build performance graphs. Once you import this check in the XI you will not be able to see values after the "|".Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.