check_ncpa get wrong alert from CPUs

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
sacom01
Posts: 194
Joined: Wed Dec 23, 2020 10:15 pm

Re: check_ncpa get wrong alert from CPUs

Post by sacom01 »

hi,
Our system so large and it is real time system, have to online 24/7 --> so we need keep monitor every 1 minute for real time alert.
i believe real time alert is a highlight feature of Nagios, so it must have a good support for this feature.

In our system, normally avg of cpus about 30-40%, sometimes in rush hours, it may took more than 70%. And we need to receive alert immediate when it come to high performance.

"CPU changes so fast that it is very hard to get the exact number." --> i know it and i was did testing parallel the commands.

it can have small different but in our case, very much different. So need you support our case.
So what can i do now?
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: check_ncpa get wrong alert from CPUs

Post by vtrac »

Hi sacom01,
How are you doing? ... :-)

I would suggested that you set 'aggregate=avg' since your system is so large with many CPU's .... :-)


Best Regards,
Vinh
sacom01
Posts: 194
Joined: Wed Dec 23, 2020 10:15 pm

Re: check_ncpa get wrong alert from CPUs

Post by sacom01 »

hi Vinh,
as i told, first, we use avg for get average from all cpus, but it got wrong number, so i tried max but it's not exact what we need.
what can i do for "avg" get exact number from our system? now it show too different with actua number. 60% vs 90%.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: check_ncpa get wrong alert from CPUs

Post by vtrac »

Hi,
How are you doing?

CPU spikes up and down very fast. I would suggest changing:
Check interval = 5 minutes
Retry-interval = 1 minutes
Max check attempt = 5

What that will do is check every five minutes. Once an issue is identified, then check every (1) minute for 5 times before sending out notification.

In your case, those remote server are so large that many CPUs are not being used (0% percent).

What did you get when you used the curl command to get the average, then divide by number of cpu cores?

Does that match what you see in Nagios's check_ncpa.py outputs?

By the way, how many CPU do you have on that remote machine?

Can you please take a screenshot of the performance graph of that one remote machine, which you said has issue?


Best Regards,
Vinh
sacom01
Posts: 194
Joined: Wed Dec 23, 2020 10:15 pm

Re: check_ncpa get wrong alert from CPUs

Post by sacom01 »

hi Vinh,
In your case, those remote server are so large that many CPUs are not being used (0% percent).
--> actually, this's not relate to our problem. (i told you about this few days ago)

What did you get when you used the curl command to get the average, then divide by number of cpu cores?
--> yes, i run the command for get total and devide by number of cpu

Does that match what you see in Nagios's check_ncpa.py outputs?
--> the avg number match with ncpa check, but too different with TOP and TOPAS command when i check on remote machine.

By the way, how many CPU do you have on that remote machine?
--> 216 CPUs

Can you please take a screenshot of the performance graph of that one remote machine, which you said has issue?
--> i replicated issue like :
1. write a script run check CPU with command "sar 1 1" in client server and set crontab for run command
2. write a script run check CPU with ncpa for client server from nagios xi, andd set crontab for run ncpa check
This two crontab run at the same time on 2 servers.
Pls find attach file for details.
You do not have the required permissions to view the files attached to this post.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: check_ncpa get wrong alert from CPUs

Post by vtrac »

Hi sacom01 (Hang),
Hope you are having a good day!! ... :-)

Can you please share the "check_ncpa.py" command used on one of your NCPA remote VM?

I'm not sure why you get "CRITICAL: Percent was 42.99 %" when your system is only at "42.99%".

Also, please run the "top" command on your NCPA remote VM and share that at well, screenshot would be nice since it is easier to see ... :-)

Here's an example of my "top" command:
F1.png
As you can see from the picture above (two red circles) .... which list "CPU%" and "Load average".
Those are very important info since they will tell us how busy the system and the respond time.


Best Regards,
Vinh
You do not have the required permissions to view the files attached to this post.
sacom01
Posts: 194
Joined: Wed Dec 23, 2020 10:15 pm

Re: check_ncpa get wrong alert from CPUs

Post by sacom01 »

hi Vinh,
Can you please share the "check_ncpa.py" command used on one of your NCPA remote VM?
--> ./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M cpu/percent -w '20' -c '40' -q 'aggregate=avg'

I'm not sure why you get "CRITICAL: Percent was 42.99 %" when your system is only at "42.99%".
--> just for testing purpose, not important.

Also, please run the "top" command on your NCPA remote VM and share that at well, screenshot would be nice since it is easier to see
--> I know TOP command, but actually, TOP and SAR is for the same purpose, check cpu. So it will show same result. (tested already).

thanks.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: check_ncpa get wrong alert from CPUs

Post by vtrac »

Hi,
Ok, now I understand ... :-)
Your setting of warning and critical at "-w '20' -c '40' ", which only for testing purpose.


Best Regards,
Vinh
sacom01
Posts: 194
Joined: Wed Dec 23, 2020 10:15 pm

Re: check_ncpa get wrong alert from CPUs

Post by sacom01 »

you understood, then....what's next?
my issue is not resolved yet :D

number of -w and -c for test purpose but ncpa got wrong alert is a real case. We need you forcus to this.
User avatar
vtrac
Posts: 903
Joined: Tue Oct 27, 2020 1:35 pm

Re: check_ncpa get wrong alert from CPUs

Post by vtrac »

Hi,
How are you doing?

I talked to my team member and he suggested that you try this on your XI command prompt:

Code: Select all

cd /usr/local/nagios/libexec

./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M cpu/percent -w '20' -c '40' -q 'aggregate=avg&sleep=5'
Since we do not have any AIX machine internally, there is no way for me to test this.

If this does not work, I would suggest that you write your own script using either "top" or "sar" or "vmstat" and put that under:

Code: Select all

/usr/local/ncpa/plugins
Then call your script as:

Code: Select all

./check_ncpa.py -H 192.168.xxx.x -t token -P 5693 -M 'plugins/yourNewScript'
You could also check out Nagios Exchange page and see if there is any modules or plugins that would fit your needs.
https://exchange.nagios.org/

Here's the one I found on Nagios Exchange:
https://exchange.nagios.org/directory/P ... IX/details


Best Regards,
Vinh