check_ncpa get wrong alert from CPUs

sacom01 · Post by **sacom01** » Mon May 17, 2021 9:26 pm

hi,
Our system so large and it is real time system, have to online 24/7 --> so we need keep monitor every 1 minute for real time alert.
i believe real time alert is a highlight feature of Nagios, so it must have a good support for this feature.

In our system, normally avg of cpus about 30-40%, sometimes in rush hours, it may took more than 70%. And we need to receive alert immediate when it come to high performance.

"CPU changes so fast that it is very hard to get the exact number." --> i know it and i was did testing parallel the commands.

it can have small different but in our case, very much different. So need you support our case.
So what can i do now?

Post by **vtrac** » Tue May 18, 2021 3:26 pm

Hi sacom01,
How are you doing? ...

I would suggested that you set 'aggregate=avg' since your system is so large with many CPU's ....

Best Regards,
Vinh

sacom01 · Post by **sacom01** » Tue May 18, 2021 8:36 pm

hi Vinh,
as i told, first, we use avg for get average from all cpus, but it got wrong number, so i tried max but it's not exact what we need.
what can i do for "avg" get exact number from our system? now it show too different with actua number. 60% vs 90%.

Post by **vtrac** » Wed May 19, 2021 11:57 am

Hi,
How are you doing?

CPU spikes up and down very fast. I would suggest changing:
Check interval = 5 minutes
Retry-interval = 1 minutes
Max check attempt = 5

What that will do is check every five minutes. Once an issue is identified, then check every (1) minute for 5 times before sending out notification.

In your case, those remote server are so large that many CPUs are not being used (0% percent).

What did you get when you used the curl command to get the average, then divide by number of cpu cores?

Does that match what you see in Nagios's check_ncpa.py outputs?

By the way, how many CPU do you have on that remote machine?

Can you please take a screenshot of the performance graph of that one remote machine, which you said has issue?

Best Regards,
Vinh

sacom01 · Post by **sacom01** » Thu May 20, 2021 10:09 pm

hi Vinh,
In your case, those remote server are so large that many CPUs are not being used (0% percent).
--> actually, this's not relate to our problem. (i told you about this few days ago)

What did you get when you used the curl command to get the average, then divide by number of cpu cores?
--> yes, i run the command for get total and devide by number of cpu

Does that match what you see in Nagios's check_ncpa.py outputs?
--> the avg number match with ncpa check, but too different with TOP and TOPAS command when i check on remote machine.

By the way, how many CPU do you have on that remote machine?
--> 216 CPUs

Can you please take a screenshot of the performance graph of that one remote machine, which you said has issue?
--> i replicated issue like :
1. write a script run check CPU with command "sar 1 1" in client server and set crontab for run command
2. write a script run check CPU with ncpa for client server from nagios xi, andd set crontab for run ncpa check
This two crontab run at the same time on 2 servers.
Pls find attach file for details.

Post by **vtrac** » Fri May 21, 2021 11:05 am

Hi sacom01 (Hang),
Hope you are having a good day!! ...

Can you please share the "check_ncpa.py" command used on one of your NCPA remote VM?

I'm not sure why you get "CRITICAL: Percent was 42.99 %" when your system is only at "42.99%".

Also, please run the "top" command on your NCPA remote VM and share that at well, screenshot would be nice since it is easier to see ...

Here's an example of my "top" command:

F1.png

As you can see from the picture above (two red circles) .... which list "CPU%" and "Load average".
Those are very important info since they will tell us how busy the system and the respond time.

Best Regards,
Vinh

sacom01 · Post by **sacom01** » Mon May 24, 2021 4:11 am

hi Vinh,
Can you please share the "check_ncpa.py" command used on one of your NCPA remote VM?
--> ./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M cpu/percent -w '20' -c '40' -q 'aggregate=avg'

I'm not sure why you get "CRITICAL: Percent was 42.99 %" when your system is only at "42.99%".
--> just for testing purpose, not important.

Also, please run the "top" command on your NCPA remote VM and share that at well, screenshot would be nice since it is easier to see
--> I know TOP command, but actually, TOP and SAR is for the same purpose, check cpu. So it will show same result. (tested already).

thanks.

Post by **vtrac** » Mon May 24, 2021 11:00 am

Hi,
Ok, now I understand ...

Your setting of warning and critical at "-w '20' -c '40' ", which only for testing purpose.

Best Regards,
Vinh

sacom01 · Post by **sacom01** » Mon May 24, 2021 10:45 pm

you understood, then....what's next?
my issue is not resolved yet

number of -w and -c for test purpose but ncpa got wrong alert is a real case. We need you forcus to this.

Post by **vtrac** » Tue May 25, 2021 3:03 pm

Hi,
How are you doing?

I talked to my team member and he suggested that you try this on your XI command prompt:

Code: Select all

cd /usr/local/nagios/libexec

./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M cpu/percent -w '20' -c '40' -q 'aggregate=avg&sleep=5'

Since we do not have any AIX machine internally, there is no way for me to test this.

If this does not work, I would suggest that you write your own script using either "top" or "sar" or "vmstat" and put that under:

Code: Select all

/usr/local/ncpa/plugins

Then call your script as:

Code: Select all

./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M 'plugins/yourNewScript'

You could also check out Nagios Exchange page and see if there is any modules or plugins that would fit your needs.
https://exchange.nagios.org/

Here's the one I found on Nagios Exchange:
https://exchange.nagios.org/directory/P ... IX/details

Best Regards,
Vinh

Nagios Support Forum

check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs

Re: check_ncpa get wrong alert from CPUs