We have an integration between Nagios XI and NagiosLS using a custom version of the check_nagioslogserver plugin that passes the NagiosXI host address along as part of the query. Basically NagiosXI is running periodic queries against NLS for specific phrases and filtering the results to just the address (IP) of the XI host. These checks ultimately open tickets for the NOC staff to work. This works well.
We've started noticing that some of these queries are returning 0 results for the check return when there are infact log events that should have been picked up. This is causing the check to recover, closing out the case on our helpdesk when its really still an issue. The case is then reopened an hour or so later when the check is run again, putting the case back at the bottom of the queue. I'm trying to trace this down and figure out where the failure is happening at. We've had some staffing changes due to Covid, so this could have been happening for some time but we were just able to get to the issue same day so the false recovery closing out the case might just not have been noticed in the past.
Is there an API log that would show me when queries were run against NLS?
On NagiosXI I see check_nagioslogserver uses stdout/echo for its error logs, would these errors end up in the root or nagios mailbox?
How would I go about finding out how many queries are being run against NLS within a 1/5/15 minute interval?
Are there any best practice tips or sizing guides for query rate on a NLS cluster? Any metrics specific to this that I should look at to see if we're just hitting sizing issues?
Thanks for any help you can provide!
API Query Issues
-
- Support Tech
- Posts: 5045
- Joined: Tue Feb 07, 2017 11:26 am
Re: API Query Issues
When the plugin runs it will create an entry like this in the /var/log/apache2/access.log:
Not really much to go on other than seeing when the check was run. To see the actual queries and results you can configure elasticsearch to log 'slow' queries - where 'slow' is value you set and is low enough capture pretty much everything. You can do this editing /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml and add this to the bottom:
and then restart elasticsearch:
This will write to /var/log/elasticsearch/<UUID>__index_search_slowlog.log has the potential to create a pretty large log so keep your eye on it.
As far as monitoring, I would suggest setting up the NCPA client on NLS so that XI can monitor the cluster status and other things like drive space, cpu, system memory, and the Java heap space - https://support.nagios.com/kb/article/n ... i-857.html.
A query that returns 0 when it should return something makes me thing that shards may not be available or there is an issue with loading results into memory. If you can identify approximately when the query failed, I would see if you can coorelate that with anything in the default elasticsearch log /var/log/elasticsearch/<UUID>.log.
On the XI side if the query is returning an 'OK' status with 0 results then I wouldn't expect there to be any errors logged or in either mail boxes. However, the plugin does return an 'UNKNOWN' status and message if there is an error which should be part of the plugin output and logged in /usr/local/nagios/var/nagios.log.
Code: Select all
192.168.55.93:80 192.168.55.20 - - [02/Sep/2020:07:36:47 +1200] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 559 "-" "BinGet/1.00.A"
Code: Select all
index.search.slowlog.threshold.query.trace: 1ms
index.search.slowlog.threshold.fetch.trace: 1ms
index.search.slowlog.threshold.index.trace: 1ms
Code: Select all
systemctl restart elasticsearch
As far as monitoring, I would suggest setting up the NCPA client on NLS so that XI can monitor the cluster status and other things like drive space, cpu, system memory, and the Java heap space - https://support.nagios.com/kb/article/n ... i-857.html.
A query that returns 0 when it should return something makes me thing that shards may not be available or there is an issue with loading results into memory. If you can identify approximately when the query failed, I would see if you can coorelate that with anything in the default elasticsearch log /var/log/elasticsearch/<UUID>.log.
On the XI side if the query is returning an 'OK' status with 0 results then I wouldn't expect there to be any errors logged or in either mail boxes. However, the plugin does return an 'UNKNOWN' status and message if there is an error which should be part of the plugin output and logged in /usr/local/nagios/var/nagios.log.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
- Posts: 159
- Joined: Wed Jun 19, 2013 10:21 am
Re: API Query Issues
Thanks cdienger this is really helpful information. I'll take a look at the apache log and the slow query log.
I like graphs...
-
- Support Tech
- Posts: 5045
- Joined: Tue Feb 07, 2017 11:26 am
Re: API Query Issues
Keep us posted.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
- Posts: 159
- Joined: Wed Jun 19, 2013 10:21 am
Re: API Query Issues
Current running theory is we're saturating a 1Gb/s link internally at the same time every night during backups. I was under the assumption that all VM's in our prod environment were living on 10Gb/s interfaces/vswitches but for some reason XI was on a 1Gb/s interface. That interface is seeing contention at roughly 8PM everynight which is also when these checks are failing. We recently halved the check interval for this particular check which gave us better insight into when exactly the issue was occurring. We're going to move XI to a 10Gb/s interface and see how it goes over the next few days.
I like graphs...
-
- Posts: 159
- Joined: Wed Jun 19, 2013 10:21 am
Re: API Query Issues
Yea that was a long shot it seems and not the issue. I haven't run the query logging yet but will end up doing so tonight.
24 hour
7 day
Everyday, right at 8PM the queries against NLS start returning less results than they should. I've checked NLS and the logs do exist; the checks should be finding it. I've looked over the major cron jobs on both xi and NLS but havn't identified anything running at that time.
I'll report back with the query logging tomorrow.
24 hour
7 day
Everyday, right at 8PM the queries against NLS start returning less results than they should. I've checked NLS and the logs do exist; the checks should be finding it. I've looked over the major cron jobs on both xi and NLS but havn't identified anything running at that time.
I'll report back with the query logging tomorrow.
You do not have the required permissions to view the files attached to this post.
I like graphs...
-
- Support Tech
- Posts: 5045
- Joined: Tue Feb 07, 2017 11:26 am
Re: API Query Issues
How frequently is the maintenance job running on the NLS system? Check it under Admin > System > Command Subsystem. If the check is run during maintenance that could explain the behavior.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
- Posts: 159
- Joined: Wed Jun 19, 2013 10:21 am
Re: API Query Issues
Once a day, looks like its going to run in a few minutes, the time doesn't match up with the issue we're seeing.
You do not have the required permissions to view the files attached to this post.
I like graphs...
-
- Posts: 159
- Joined: Wed Jun 19, 2013 10:21 am
Re: API Query Issues
I ran the slowquery logging last night and didn't really see any issues when I reviewed the output but I'm not an expert and just looked at the ms time values.
I did notice a few things.
8:00 - 8:20 is the time frame where the issue occurs.
Nagios XI and NagiosLS load was low or normal on all hosts.
The NLS index for the next day is getting created at 8PM right when this is happening.
Backup jobs on our infrastructure are kicking off right around this same time frame. NagiosXI has a backup job (Veeam, snapshot based) that starts after the issue time frame at 8:47PM. Looking over the layout of the network path between XI and NLS shows some contention during this time frame. XI is on a 10Gb/s network while the NLS hosts are on a 1Gb/s network. The 1Gb/s network is seeing contention while the 10Gb/s network is fine.
All storage, for all VM's is SAN based. The syadmin isn't seeing any performance issues on the VM's or SAN though. SAN is tiered storage with a 10TB flash cache and 30TB's of 10k disks.
Unless cdienger sees anything out of sorts in the logs I sent him via PM, I'm leaning towards a NIC saturation issue at this point but am open to other ideas. I've approached my sysadmin about delaying backups by an hour to see if the issue persists but thats a larger change than he wants to take right off the bat.
Is there any other logging or metrics I can pull while this is happening tonight that would help narrow down where the issue is?
I did notice a few things.
8:00 - 8:20 is the time frame where the issue occurs.
Nagios XI and NagiosLS load was low or normal on all hosts.
The NLS index for the next day is getting created at 8PM right when this is happening.
Backup jobs on our infrastructure are kicking off right around this same time frame. NagiosXI has a backup job (Veeam, snapshot based) that starts after the issue time frame at 8:47PM. Looking over the layout of the network path between XI and NLS shows some contention during this time frame. XI is on a 10Gb/s network while the NLS hosts are on a 1Gb/s network. The 1Gb/s network is seeing contention while the 10Gb/s network is fine.
All storage, for all VM's is SAN based. The syadmin isn't seeing any performance issues on the VM's or SAN though. SAN is tiered storage with a 10TB flash cache and 30TB's of 10k disks.
Unless cdienger sees anything out of sorts in the logs I sent him via PM, I'm leaning towards a NIC saturation issue at this point but am open to other ideas. I've approached my sysadmin about delaying backups by an hour to see if the issue persists but thats a larger change than he wants to take right off the bat.
Is there any other logging or metrics I can pull while this is happening tonight that would help narrow down where the issue is?
I like graphs...
-
- Support Tech
- Posts: 5045
- Joined: Tue Feb 07, 2017 11:26 am
Re: API Query Issues
Nothing is jumping out from the logs that were provided, but the fact that it happens when the new index is created makes me think its related to how that is being handled. I'd like to get a profile from each NLS machine to take a look at some of the other logs. This can be done from the command line with:
This will create /tmp/system-profile.tar.gz. Please run this on each machine.
I'd also like to see the full query that is being run by XI so I can better filter through the logs. Please provide a screenshot of the plugin's settings so we can see all the details.
Lastly, I'd also like to get a copy of the current settings index. This can be gathered by running:
The file it creates and that we'd like to see is /tmp/nagioslogserver.tar.gz. It only needs to be run on one machine in the cluster.
Code: Select all
/usr/local/nagioslogserver/scripts/profile.sh
I'd also like to see the full query that is being run by XI so I can better filter through the logs. Please provide a screenshot of the plugin's settings so we can see all the details.
Lastly, I'd also like to get a copy of the current settings index. This can be gathered by running:
Code: Select all
curl -XPOST http://localhost:9200/nagioslogserver/_export?path=/tmp/nagioslogserver.tar.gz
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.