API Query Issues

Envera IT · Post by **Envera IT** » Mon Aug 31, 2020 4:08 pm

We have an integration between Nagios XI and NagiosLS using a custom version of the check_nagioslogserver plugin that passes the NagiosXI host address along as part of the query. Basically NagiosXI is running periodic queries against NLS for specific phrases and filtering the results to just the address (IP) of the XI host. These checks ultimately open tickets for the NOC staff to work. This works well.

We've started noticing that some of these queries are returning 0 results for the check return when there are infact log events that should have been picked up. This is causing the check to recover, closing out the case on our helpdesk when its really still an issue. The case is then reopened an hour or so later when the check is run again, putting the case back at the bottom of the queue. I'm trying to trace this down and figure out where the failure is happening at. We've had some staffing changes due to Covid, so this could have been happening for some time but we were just able to get to the issue same day so the false recovery closing out the case might just not have been noticed in the past.

Is there an API log that would show me when queries were run against NLS?

On NagiosXI I see check_nagioslogserver uses stdout/echo for its error logs, would these errors end up in the root or nagios mailbox?

How would I go about finding out how many queries are being run against NLS within a 1/5/15 minute interval?

Are there any best practice tips or sizing guides for query rate on a NLS cluster? Any metrics specific to this that I should look at to see if we're just hitting sizing issues?

Thanks for any help you can provide!

Post by **cdienger** » Tue Sep 01, 2020 3:22 pm

When the plugin runs it will create an entry like this in the /var/log/apache2/access.log:

Code: Select all

192.168.55.93:80 192.168.55.20 - - [02/Sep/2020:07:36:47 +1200] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 559 "-" "BinGet/1.00.A"

Not really much to go on other than seeing when the check was run. To see the actual queries and results you can configure elasticsearch to log 'slow' queries - where 'slow' is value you set and is low enough capture pretty much everything. You can do this editing /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml and add this to the bottom:

Code: Select all

index.search.slowlog.threshold.query.trace: 1ms
index.search.slowlog.threshold.fetch.trace: 1ms
index.search.slowlog.threshold.index.trace: 1ms

and then restart elasticsearch:

Code: Select all

systemctl restart elasticsearch

This will write to /var/log/elasticsearch/<UUID>__index_search_slowlog.log has the potential to create a pretty large log so keep your eye on it.

As far as monitoring, I would suggest setting up the NCPA client on NLS so that XI can monitor the cluster status and other things like drive space, cpu, system memory, and the Java heap space - https://support.nagios.com/kb/article/n ... i-857.html.

A query that returns 0 when it should return something makes me thing that shards may not be available or there is an issue with loading results into memory. If you can identify approximately when the query failed, I would see if you can coorelate that with anything in the default elasticsearch log /var/log/elasticsearch/<UUID>.log.

On the XI side if the query is returning an 'OK' status with 0 results then I wouldn't expect there to be any errors logged or in either mail boxes. However, the plugin does return an 'UNKNOWN' status and message if there is an error which should be part of the plugin output and logged in /usr/local/nagios/var/nagios.log.

Envera IT · Post by **Envera IT** » Wed Sep 02, 2020 9:04 am

Thanks cdienger this is really helpful information. I'll take a look at the apache log and the slow query log.

Post by **cdienger** » Wed Sep 02, 2020 3:49 pm

Keep us posted.

Envera IT · Post by **Envera IT** » Tue Sep 22, 2020 11:10 am

Current running theory is we're saturating a 1Gb/s link internally at the same time every night during backups. I was under the assumption that all VM's in our prod environment were living on 10Gb/s interfaces/vswitches but for some reason XI was on a 1Gb/s interface. That interface is seeing contention at roughly 8PM everynight which is also when these checks are failing. We recently halved the check interval for this particular check which gave us better insight into when exactly the issue was occurring. We're going to move XI to a 10Gb/s interface and see how it goes over the next few days.

Envera IT · Post by **Envera IT** » Wed Sep 23, 2020 9:57 am

Yea that was a long shot it seems and not the issue. I haven't run the query logging yet but will end up doing so tonight.

24 hour

2020-09-23 09_36_06-Nagios XI.png

7 day

2020-09-23 09_35_35-Snipping Tool.png

Everyday, right at 8PM the queries against NLS start returning less results than they should. I've checked NLS and the logs do exist; the checks should be finding it. I've looked over the major cron jobs on both xi and NLS but havn't identified anything running at that time.

I'll report back with the query logging tomorrow.

Post by **cdienger** » Wed Sep 23, 2020 1:41 pm

How frequently is the maintenance job running on the NLS system? Check it under Admin > System > Command Subsystem. If the check is run during maintenance that could explain the behavior.

Envera IT · Post by **Envera IT** » Wed Sep 23, 2020 2:45 pm

Once a day, looks like its going to run in a few minutes, the time doesn't match up with the issue we're seeing.

Capture.PNG

Envera IT · Post by **Envera IT** » Thu Sep 24, 2020 10:28 am

I ran the slowquery logging last night and didn't really see any issues when I reviewed the output but I'm not an expert and just looked at the ms time values.

I did notice a few things.

8:00 - 8:20 is the time frame where the issue occurs.

Nagios XI and NagiosLS load was low or normal on all hosts.

The NLS index for the next day is getting created at 8PM right when this is happening.

Backup jobs on our infrastructure are kicking off right around this same time frame. NagiosXI has a backup job (Veeam, snapshot based) that starts after the issue time frame at 8:47PM. Looking over the layout of the network path between XI and NLS shows some contention during this time frame. XI is on a 10Gb/s network while the NLS hosts are on a 1Gb/s network. The 1Gb/s network is seeing contention while the 10Gb/s network is fine.

All storage, for all VM's is SAN based. The syadmin isn't seeing any performance issues on the VM's or SAN though. SAN is tiered storage with a 10TB flash cache and 30TB's of 10k disks.

Unless cdienger sees anything out of sorts in the logs I sent him via PM, I'm leaning towards a NIC saturation issue at this point but am open to other ideas. I've approached my sysadmin about delaying backups by an hour to see if the issue persists but thats a larger change than he wants to take right off the bat.

Is there any other logging or metrics I can pull while this is happening tonight that would help narrow down where the issue is?

Post by **cdienger** » Thu Sep 24, 2020 4:49 pm

Nothing is jumping out from the logs that were provided, but the fact that it happens when the new index is created makes me think its related to how that is being handled. I'd like to get a profile from each NLS machine to take a look at some of the other logs. This can be done from the command line with:

Code: Select all

/usr/local/nagioslogserver/scripts/profile.sh

This will create /tmp/system-profile.tar.gz. Please run this on each machine.

I'd also like to see the full query that is being run by XI so I can better filter through the logs. Please provide a screenshot of the plugin's settings so we can see all the details.

Lastly, I'd also like to get a copy of the current settings index. This can be gathered by running:

Code: Select all

curl -XPOST http://localhost:9200/nagioslogserver/_export?path=/tmp/nagioslogserver.tar.gz

The file it creates and that we'd like to see is /tmp/nagioslogserver.tar.gz. It only needs to be run on one machine in the cluster.

Nagios Support Forum

API Query Issues

API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues

Re: API Query Issues