Nagios user java command using over 200% CPU

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Nagios user java command using over 200% CPU

Post by rferebee »

Good morning,

I'm trying to troubleshoot a potential issue with our main Log Server cluster. The primary node shows the Nagios user account running a java command and utilizing over 200% of the CPU resources. See attached screen shot.

It's this normal behavior? We have over 500 hosts sending in logs and we intermittently see performance spikes of over 90% CPU utilization in vSphere. We currently assigned 36 cores between the two nodes and we're still seeing the spikes.

Thank you.
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios user java command using over 200% CPU

Post by cdienger »

There will be two java processes on the NLS machine - one for logstash and one for elasticsearch. You can see more information including the PID and compare that to the PID seen in the top command to identify which is which. That said, it's not uncommon to see spikes in activity when queries(especially ones that can return a lot of data) are being made or maintenance tasks are run.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Nagios user java command using over 200% CPU

Post by rferebee »

Thank you for your reply.

Does 36 cores seem like overkill for our environment? The Nagios recommended specs indicate a lot less is necessary. We are taking in IIS logs from 10 different Exchange servers which is a change we made last month, but even then we're well under 1000 hosts.

Or, perhaps the better question, what resource has the greatest affect on system performance? CPU or RAM?

I find it hard to believe that with 36 cores we should be getting anywhere near 100% CPU utilization. See attached for CPU % in the last 24 hours.
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios user java command using over 200% CPU

Post by cdienger »

Are the cores split evenly between the two so there are 18 on each machine? Does 'top' show spikes near 1800%? I'm not sure how the graph provided is generated, but it doesn't seem to match what the top command is showing.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Nagios user java command using over 200% CPU

Post by rferebee »

No, each server in the cluster has 36 cores (6 CPUs with 6 cores each).

The biggest spike I've seen using 'top' is just over 400%. Nothing near 1800%... at least that I've seen.

Sorry, the graph was generate in vSphere from the primary server in our main cluster. It's a representation of CPU% over the last 24 hours.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios user java command using over 200% CPU

Post by cdienger »

Those are pretty big machines and probably a bit of an overkill given the load seen in top doesn't seem to require it. Do you experience an actual slowness when using it or is this more of a concern regarding the graph? What is the full check that is generating the graph?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Nagios user java command using over 200% CPU

Post by rferebee »

Yes, we experience slowness, unresponsiveness and system lockups.

For example, if a snapshot is in progress, if I try and navigate to the 'Snapshots & Maintenance' page the entire system locks up and I have to wait roughly 20-30 minutes before I get control again.

I assumed it was because the system was being taxed while running the snapshot, but now I'm thinking it's something else given your response.

Here's the graph information from VMware: https://docs.vmware.com/en/VMware-vSphe ... 54817.html
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios user java command using over 200% CPU

Post by cdienger »

What time are snapshots_maintenance run(Admin > System > Command Subsystem) ? Does top show a larger spike during this time? Try setting the snapshots_maintenance job to run during off peak hours.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Nagios user java command using over 200% CPU

Post by rferebee »

Our snapshots start at 22:30 every night. Sometimes, however, they don't finish until into the next day.

I had a prior ticket open regarding that and it was suggested I decrease the number of indexes we optimize. I did that, but occasionally it still takes a long time for the snapshots to finish. There is one running from last night as I type this.

Spikes only occur during business hours.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios user java command using over 200% CPU

Post by cdienger »

I assume the slowness is happening through out the day and occurring even when snapshots are not being taken? Do you have a lot of alerts set up or a lot of users going through dashboards? Large frequent queries(date ranges, using wildcards, etc...) can spike things.

You may want to monitor the CPU of the machine using NCPA:

https://www.nagios.org/ncpa/
https://www.nagios.org/ncpa/getting-started.php#linux
https://www.youtube.com/watch?v=cYUduX5 ... hkqTa3_9io
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.