NLS stopped working

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

NLS stopped working

Post by WillemDH »

Hello,

It seems since yesterday 08:00 no more logs were visible in my NLS.

Monday evening 09/02 I did a yum update on one node. Did the same yesterday around noon on the other node, but didn't notice it wasn't working afterwards.

I tried re-applying configuration. I can't even go into dashboards, seems like my NLS is frozen again.

Code: Select all

11363 nagios    20   0 11.2g 1.5g 358m S 112.7 39.1   1442:40 java
24849 root      39  19 4007m 358m  12m S 24.9  9.0   1:03.66 java
So I initiated a restart of the elasticsearch service on one node (with hte highest Java cpu proc)

Code: Select all

service elasticsearch restart
Stopping elasticsearch:                                    [  OK  ]
Starting elasticsearch:                                    [  OK  ]
Aft which it seems some logs are getting processed again. i'll update this thread later to confirm if the issue was solved or not

Please advice how to troubleshoot this. I would like to find out why it became unstable, but I was not able to find any useful info in elasticsearch or logstash logs. Both NLS servers have 6 cpu's no and 4 Gb RAM, swap file of 2 GB. I did not add any new sources since last week. As we would want to make NLS our primary central logging system and eventually add more and more sources, some of which are very critical, I need to be sure it stays stable and be alerted when something goes wrong. As the elasticsearch and logstash service were still running, how could I make a Nagios XI check to see if anything is going wrong? This has been discussed in an other thread too. Imho it's very important we get alerted if some or all (in this case) sources logs stop getting processed.

Even now I see on my second log server where I did not restart elasticsearch service high spiking cpu java proc: (227 % cpu?) (EDIT: Saw it even go to 330%, but did not have the time to take a screenshot)

Code: Select all

 1249 nagios    20   0 11.3g 1.7g 524m S 227.8 44.0  48:47.28 java
29827 root      39  19 3991m 289m  12m S 12.0  7.3   3:12.59 java
EDIT: It seems after the service elasticsearch restart, logs are being processed again, but I do have a gap of 25 hours. See screenshot. Any advice how to prevent his is welcome. CPU seems to have calmed down after I restarted elasticsearch service on the second node.

EDIT2: had to restart elasticsearch service again on both nodes. Even after deleting the dashboard hat I thought was causing this, There must be something else wrong.

Grtz

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: NLS stopped working

Post by WillemDH »

As my NLS stopped processing logs again, see screenshot. I managed to kickstart it again by applying configuration.
I suspect a 'service elasticsearch restart' to solve a hanging / frozen dashboard (as suggested in http://support.nagios.com/forum/viewtop ... +dashboard) alone is not enough.
It seem neceassry to apply configuration afterwards or NLS just stops working.
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: NLS stopped working

Post by scottwilkerson »

I have a feeling you may be hitting a bug recently discovered causing logstash to freeze if too many files are open. You may be able to verify this by looking at /var/log/logstash/logstash.log

We are working on a fix and it is in final testing and available in the next release.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: NLS stopped working

Post by WillemDH »

Scott,

I tailed /var/log/logstash/logstash.log

It seems (and it indeed matches plus minus the time I had issues) my logfile was full of these:

Code: Select all

{:timestamp=>"2015-02-11T11:28:52.570000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>278, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112:in `buffer_initialize'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110:in `buffer_initialize'"], :level=>:warn}
{:timestamp=>"2015-02-11T11:28:52.572000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>877, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112:in `buffer_initialize'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110:in `buffer_initialize'"], :level=>:warn}
Does these logs would imply the bug you were talking about?

Grtz
Nagios XI 5.8.1
https://outsideit.net
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NLS stopped working

Post by tmcdonald »

Please run the following and show us the output:

Code: Select all

lsof | wc -l
Also, please post a screenshot of your Administration -> Backup & Maintenance page.
Former Nagios employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: NLS stopped working

Post by WillemDH »

I had more issues today. Dashboard loading was superslow. It's like since I added 5 cpu's performance has only gone down.... :(

Had to install lsof first. This command was executed after an elasticsearch restart, as the nls site was frozen again.

Node01:

Code: Select all

 lsof | wc -l
6148
Node02:

Code: Select all

lsof | wc -l
6410
I'm sorry to say this, but we have got nothing but problems since we started using Nagios Log Server. I've spent multiple days troubleshooting and trying to make the NLS stable. 23/02 I have to do a presentation to our management about the NLS server. If I don't manage to make it stable, I will have to postpone this presentation, as it is just too slow or starts hanging / freezing or event stops processing logs completely. I have no idea how to explain the time I invested in NLS or why it's just not stable enough to process logs of 34 esx servers, 1 infoblox device, one Windows server and 3 linux servers (Nagios XI + 2 NLS) on 2 NLS servers with 6 cpu's, 4 GB RAM and SSD storage..
It's not like I'm doing any exotic configuration and I would think our NLS is not receiving as many logs as it should be able to receive.

A tail Logstash log:

Code: Select all

 tail -f /var/log/logstash/logstash.log
{:timestamp=>"2015-02-12T11:37:51.987000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>634, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112:in `buffer_initialize'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110:in `buffer_initialize'"], :level=>:warn}
{:timestamp=>"2015-02-12T11:37:52.153000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>1145, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112:in `buffer_initialize'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110:in `buffer_initialize'"], :level=>:warn}
{:timestamp=>"2015-02-12T11:37:52.244000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>5000, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:159:in `buffer_receive'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:311:in `receive'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:86:in `handle'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:78:in `worker_setup'"], :level=>:warn}
{:timestamp=>"2015-02-12T11:37:52.266000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>1588, :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2097:in `close'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:168:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:156:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217:in `post!'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:106:in `bulk_ftw'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:80:in `bulk'", "/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315:in `flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1339:in `each'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193:in `buffer_flush'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112:in `buffer_initialize'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110:in `buffer_initialize'"], :level=>:warn}
{:timestamp=>"2015-02-12T11:43:31.399000+0100", :message=>"Using milestone 2 input plugin 'tcp'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-12T11:43:31.602000+0100", :message=>"Using milestone 1 input plugin 'syslog'. This plugin should work, but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin.  For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-12T12:00:20.674000+0100", :message=>"syslog udp listener died", :address=>"0.0.0.0:5546", :exception=>#<SocketError: recvfrom: name or service not known>, :backtrace=>["/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:119:in `udp_listener'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:118:in `udp_listener'", "/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:76:in `run'"], :level=>:warn}
{:timestamp=>"2015-02-12T12:00:30.270000+0100", :message=>"Using milestone 2 input plugin 'tcp'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-12T12:00:30.454000+0100", :message=>"Using milestone 1 input plugin 'syslog'. This plugin should work, but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin.  For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-12T12:02:42.256000+0100", :message=>"syslog udp listener died", :address=>"0.0.0.0:5545", :exception=>#<SocketError: recvfrom: name or service not known>, :backtrace=>["/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:119:in `udp_listener'", "org/jruby/RubyKernel.java:1521:in `loop'", "/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:118:in `udp_listener'", "/usr/local/nagioslogserver/logstash/lib/logstash/inputs/syslog.rb:76:in `run'"], :level=>:warn}
EDIT 1: I just had to restart elasticsearch service again and afterwards I tried applying configuration and the website is completely frozen...

EDIT 2: After another restart of elasticsearch service, I can log into the website again, but it seems logs are no longer getting processed.. I can't just keep restarting elasticsearch service...... and hoping it will suddenly work. When applying config, I get " The apply command hasn't started yet. The instance may not be online or is unreachable."

EDIT 3: After doing a service elasticserch restart on the node which could not apply conf and re-applying configuration, logs are coming in again.

EDIT 4: Just realised I'm monitoring the NLS servers with Nagios, so I attached a graph of open files from the moment I installed them. I hope it helps.

EDIT 5: When I read posts on GitHub, eg https://github.com/elasticsearch/logstash/issues/1896 of people with "Errno::EBADF: Bad file descriptor - Bad file descriptor" errors, the are talking about a misconfiguration
My issue turned out to be a misconfiguration. I had 127.0.0.1 for the elasticsearch output host on my remote nodes, when I should have targeted the proper elasticsearch server in my organization.
Could I have the same misconfiguration? Where can I check this?

/etc/rsyslog.d/nagioslogserver.conf nls01

Code: Select all

# ### begin forwarding rule ###
#
# NAGIOS LOG SERVER
#
$WorkDirectory /var/lib/rsyslog    # where to place spool files
$ActionQueueFileName fwdRule1      # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g        # 1gb space limit (use as much as possible)
$ActionQueueSaveOnShutdown on      # save messages to disk on shutdown
$ActionQueueType LinkedList        # run asynchronously
$ActionResumeRetryCount -1         # infinite retries if host is down
*.* @@localhost:5546
#
# ### end of the forwarding rule ###
/etc/rsyslog.d/nagioslogserver.conf nls02

Code: Select all

# ### begin forwarding rule ###
#
# NAGIOS LOG SERVER
#
$WorkDirectory /var/lib/rsyslog    # where to place spool files
$ActionQueueFileName fwdRule1      # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g        # 1gb space limit (use as much as possible)
$ActionQueueSaveOnShutdown on      # save messages to disk on shutdown
$ActionQueueType LinkedList        # run asynchronously
$ActionResumeRetryCount -1         # infinite retries if host is down
*.* @@localhost:5546
#
# ### end of the forwarding rule ###
Could it have something to do with the port I changed to 5546, as discussed in ticket 2015012810000141?

This the filter I had to make or all my Linux servers, as otherwise I was experiencing date parsing errors:

Code: Select all

syslog {
    type => 'syslog-linux'
    port => 5546
}
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: NLS stopped working

Post by scottwilkerson »

WillemDH,

We are aware of the "leaking file descriptors" issue, have a fix in place and is going through final testing, we should have a release fixing this issue soon.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: NLS stopped working

Post by WillemDH »

Scott,

Are you sure the issues I'm experiencing are the same "leaking file" issue you are talking about? When i read through people's thread about the leaking file issue, I don't see any of these:

Code: Select all

Logstash Daemon   Logstash Daemon dead but pid file exists
The problems starts when I open a dashboard. The dashboard keeps loading and after x time (about 60-90 seconds) the NLS wesbite freezes, I can no longer go into other dashboards or do anything untill I restart elasticsearch. I have not seen my logstash service stopping.
Then after I restart elasticsearch I have to apply config one in three times to kickstart log processing again.

I might be wrong, but the leaking file threads seem to all speak about the logstash service stopping...

The dashboards that make things hang, vary in complexity. Some of have only a few simple queries, see as an example a dashboard that making things hang this very moment. Every dashlet has the loading circle spinning forever, in this case 3+ minutes. After a restart of elasticsearch service things seem to work for a limited time while browsing dashboards, but after some time 10-15 minutes, the next dashboard I load makes things hang again.

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: NLS stopped working

Post by scottwilkerson »

I can't guarantee this is the same problem, but likely.

A new version (2015R1.3) came out yesterday that will resolve the issue if installed and then you run an Apply Configuration (even if you didn't make any config changes)
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: NLS stopped working

Post by WillemDH »

Hey Scott,

Just installed 1.3 and did some basic tests for about an hour. I did not see any excessive dashboard loading and no freezes so far. I'll do some more tests on Tuesday and will let you know the results.

Thanks and grtz..

Willem
Nagios XI 5.8.1
https://outsideit.net