Gaps in alerting data and failed alerts

weveland · Post by **weveland** » Mon Sep 21, 2015 12:49 pm

Good morning to some and afternoon to others. I've got yet another quandary.

I have a few alerts setup to look for a specific web request to be coming in regularly. This is to verify my web backend is working and also to graph the frequency of requests.

So I've got three alerts setup.

The first checks for the correct type (apache_error OR apache_access)
Then checks for the appropriate host (host:hostname.tld)
Then checks for the request field for the specific request ( /mailbox in one and /refill.php in the other)
It is set to check every 5 minutes with a critical and warning value of 1: and 1: respectively.
the /refill.php check has a lookback period of 12hours to make sure there's been one successful POST request to that in the last 12 hours (always should be).

The problem I have is that starting at 8pm and going forward until sometimes 8-10am the following day these requests fail saying 0 records found. If I click to view the alert on a dashboard I can clearly see that there are requests coming in. This morning on a whim I changed the lookback period from 12h to 1h and then re-ran the alert. It immediately came back with the results I expected and was no longer critical. When i changed the lookback period to 12h it again failed returning 0 results. If I changed it to 10 hours it actually had results. It almost seems like something is happening to the search when the index period rolls over where if the search crosses an index it fails. Which appears to happen at 8pm (GMT -04:00) which would be midnight GMT. Based on reading I expect the indexes to roll over then because ES works on GMT.

Here are some examples of my graphs in NagiosXI that are created from the NRDP data sent from the log server.

nagiosxi gap 3rd.png

nagiosxi gap 2nd.png

nagios gap.png

Any ideas?

--
Wayne

jolson · Post by **jolson** » Mon Sep 21, 2015 4:37 pm

I took a look at this on a few of my lab systems, but couldn't get your problem to reproduce appropriately. Could you give me some explicit reproduction steps? This might be something that necessitates a remote session if the behavior continues.

weveland · Post by **weveland** » Mon Sep 21, 2015 4:59 pm

I can send you my configuration file and some examples. But there's nothing inherently different about this than most of my other checks.
I don't want to send them publicly however so if you have a separate method to send them I'd appreciate it.

Specifically I have the normal log parsing filters in place.
I then created a dashboard that matched specifically the alerts that I wanted to find.
Then I used the "create alert from this dashboard" to create an alert function.
The actual alerts I gave a check interval of 5m and a lookback of 12h to.
Those then got warning and critical thresholds of 1:
The logging data I gave a 5m check and 5m lookback with a warning of 1000 and critical of 2000 (don't need alerts just interval data)

These are sent via NRDP to my NagiosXI server where there is a hostname and passive service matching the one specified in the alert settings. For each individual alert.

If the check runs for the 12hr lookback within the 8pm - 8am timeframe the check will come back with a status of "Critical 0 records found" this is reported to NagiosXI and alerts are sent out to the IT staff.
If I shorten the lookback period to some arbitrary timeframe (that doesn't go beyond 8pm the previous day) it reports the correct amount of records.
If I view a dashboard from the alert tab (little monitor icon) I can easily see the records there.

It's almost like whatever program is handling the alerts isn't able to search with lookback beyond the current index. If that makes sense to you.

If you need to do a remote session just let me know and we can setup a time.

weveland · Post by **weveland** » Tue Sep 22, 2015 11:13 am

Any further thoughts Jesse?

jolson · Post by **jolson** » Tue Sep 22, 2015 11:50 am

My apologies - I've been busy preparing for the conference and didn't keep track of this thread - my thoughts are as follows:

If you could PM me your configuration file/examples I would appreciate it, this is something that I need to reproduce in house to troubleshoot effectively since I haven't seen it before.

That being said, I haven't used the _create alert from this dashboard_ function - I will give this a try now. I'll use your alert settings as described.

It's almost like whatever program is handling the alerts isn't able to search with lookback beyond the current index. If that makes sense to you.

The interesting thing is that the alerts subsystem queries elasticsearch directly using the same query that you would use - which makes this issue all the more interesting.

You get me those files and I'll attempt reproduction using the alert settings you've described. If we can't get it resolved from there, we'll get a remote session setup for debugging.

Thanks Wayne!

jolson · Post by **jolson** » Tue Sep 22, 2015 3:30 pm

I attempted to reproduce this bug on my end to no success - even after creating alerts from a particular dashboard.

If you could get me those configuration files, I'd like to look through them.

weveland · Post by **weveland** » Wed Sep 23, 2015 8:13 am

Sorry about that. I must not have flagged the notify tag on this thread so I missed your replies yesterday. I'll PM those over to you right away.

--
Wayne

jolson · Post by **jolson** » Wed Sep 23, 2015 4:38 pm

Your configurations look very normal as you mentioned, and I don't see anything that could cause this problem in them. I'd like you to send an email to customersupport@nagios.com with a reference to this thread - from there we can set up a remote session and get this taken care of. Thanks!

EDIT: Locking, received email.

Nagios Support Forum

Gaps in alerting data and failed alerts

Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts

Re: Gaps in alerting data and failed alerts