Help with monitor check settings

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mcdonamw
Posts: 13
Joined: Mon Dec 04, 2023 5:36 pm

Help with monitor check settings

Post by mcdonamw »

I have a CPU service monitor configured to go critical at 100% utilization. I am getting alerts more than I believe I should based on what I'm seeing in my performance graphs. Perhaps I'm misunderstanding something.

My settings are:
  • Check Interval: 5
  • Retry Interval: 1
  • Max Check Attempts: 15
With these settings, I would expect the CPU to be at 100% for 15 minutes before it sent an alert, but in my most recent alert, per the performance chart it went 100% at 9:19am and remained until 9:32am. This is only 13 minutes. Two days before I received an alert at 9:26 that cleared on 9:36, but the chart does not match up to that period either. I show only 5 minutes of 100% cpu, but the times are off too.

Am I missing something or is something not working right?

Note: I really find it interesting that almost every instance I look at on the graph, the CPU drops from 100% almost immediately after the last check that triggered the alert.

Image
Image
User avatar
lgute
Posts: 318
Joined: Mon Apr 06, 2020 2:49 pm

Re: Help with monitor check settings

Post by lgute »

Hi @mcdonamw, thanks for reaching out.

Could you post your service object definition?

You are using these directives.
Check Interval: 5
Retry Interval: 1
Max Check Attempts: 15
These definitions from the Service Definition documentation in the https://assets.nagios.com/downloads/nag ... efinitions document may be helpful.
check_interval: This directive is used to define the number of "time units" to wait before scheduling the next "regular" check of the service. "Regular" checks are those that occur when the service is in an OK state or when the service is in a non-OK state, but has already been rechecked max_check_attempts number of times. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the service. Services are rescheduled at the retry interval when they have changed to a non-OK state. Once the service has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

max_check_attempts: This directive is used to define the number of times that Nagios will retry the service check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the service check again.

notification_interval: This directive is used to define the number of "time units" to wait before re-notifying a contact that this service is still in a non-OK state. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this service - only one problem notification will be sent out.
So I am curious what notification_interval you are using.

The Check Scheduling documentation may also be of help.
Please let us know if you have any other questions or concerns.

-Laura