Rsyslog brings down the house if NLS goes down

polarbear1 · Post by **polarbear1** » Wed May 06, 2015 3:15 pm

Have a bit of a rsyslog problem. In short, if NLS goes down for any reason, the stock behavior of rsyslog on the client machine is to keep buffering and spinning up threads and slowing down the machine to where even a ssh login that took 1/2 second under normal conditions, will take 5+ seconds. From what I read on rsyslog's website this is totally standard behavior, so no big surprised there, but this is not what I want. The fact that rsyslog causes these slowdowns can be easily verified by reconnecting to NLS and giving it a minute to catch up, or simply by killing rsyslog on the client machine.

In the perfect world, I'd like to buffer, disk buffer even (my production application is all in the RAM, so disk performance is not important) for some reasonable time (up to a day or 2) and then start dumping files rather than slowing down my system.

http://b.kl3in.com/2011/10/ubuntu-serve ... esponding/ I considered this, but this method drops all the buffer, potentially causing me a full day of lost records, which is not ideal. I considered switching to UDP, but that presents a similar problem. I tried messing around with the forwarding (near the bottom) section of rsyslog.conf in VM test, but I observed that after it runs out of allocated disk space, it will still start slowing down rather than dumping. NLS also doesn't appear to be using that forwarding configuration, so that would make that setting completely irrelevant anyway.

I'm stuck here - how would I go about having my reasonable buffer, but then dump rather than slow down.

Code: Select all

Red Hat Enterprise Linux Server release 6.5 (Santiago)
Linux schpnag1 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

rsyslog.conf

Code: Select all

# rsyslog v5 configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imklog   # provides kernel logging support (previously done by rklogd)
#$ModLoad immark  # provides --MARK-- message capability

# Provides UDP syslog reception
#$ModLoad imudp
#$UDPServerRun 514

# Provides TCP syslog reception
#$ModLoad imtcp
#$InputTCPServerRun 514


#### GLOBAL DIRECTIVES ####

# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
#$ActionFileEnableSync on

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf


#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 *

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log


# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$WorkDirectory /var/lib/rsyslog # where to place spool files
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###

# A template to for higher precision timestamps + severity logging
$template SpiceTmpl,"%TIMESTAMP%.%TIMESTAMP:::date-subseconds% %syslogtag% %syslogseverity-text%:%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%\n"

:programname, startswith, "spice-vdagent"	/var/log/spice-vdagent.log;SpiceTmpl

one of the config files in rsyslog.d - they are all the same (except the file it checks)

Code: Select all

$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for gvsi_news_mt
$InputFileName /home/wombat/feeds/mq_reader/log/gvsi_news_mt.log
$InputFileTag gvsi_news_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_gvsi_news_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'gvsi_news_mt' then @@schpnag1:5544
if $programname == 'gvsi_news_mt' then ~

jolson · Post by **jolson** » Wed May 06, 2015 4:08 pm

Please read the following:

If no queue is setup, logs can't be dropped and have to be preserved my memory or disk.

Let's add some extra configuration to rsyslog that will queue messages if Papertrail becomes unreachable, and time out new syslog() calls in 10ms. The latter prevents stalling if the queue becomes full.

Adding the following just before *.* @@nagioslogserver and restarting rsyslog should do the job:

$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName papertrailqueue
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 10
$ActionQueueDiscardSeverity 0

I am confident that the above post will help you out - let us know if you have any questions about it!

polarbear1 · Post by **polarbear1** » Tue Jun 02, 2015 11:30 am

Bit late on the response here, but I think the solution provided in the link reflects my initial concern about in the OP.

NLS doesn't appear to be using the forwarding rules as listed in rsyslog.conf

Code: Select all

# rsyslog v5 configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imklog   # provides kernel logging support (previously done by rklogd)
#$ModLoad immark  # provides --MARK-- message capability

# Provides UDP syslog reception
#$ModLoad imudp
#$UDPServerRun 514

# Provides TCP syslog reception
#$ModLoad imtcp
#$InputTCPServerRun 514


#### GLOBAL DIRECTIVES ####

# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
#$ActionFileEnableSync on

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf


#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 *

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log


# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$WorkDirectory /var/lib/rsyslog # where to place spool files
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###

# A template to for higher precision timestamps + severity logging
$template SpiceTmpl,"%TIMESTAMP%.%TIMESTAMP:::date-subseconds% %syslogtag% %syslogseverity-text%:%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%\n"

:programname, startswith, "spice-vdagent"	/var/log/spice-vdagent.log;SpiceTmpl

So where would I apply these settings?

Code: Select all

$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName papertrailqueue
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 10
$ActionQueueDiscardSeverity 0

It looks like I would have to modify every file that the linux-install.sh script creates in /etc/rsysyslog.d/90-nagioslogserver_xxx.conf

Code: Select all

$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for messages
$InputFileName /var/log/messages
$InputFileTag messages:
$InputFileStateFile nls-state-var_log_messages # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
----------INSERT THE ABOVE SETTINGS IN HERE ---------------
# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'messages' then @@schpnag1:5544
if $programname == 'messages' then ~

Sorry for the stupidly basic question, just want to make sure I'm on the right track before I start changing several dozen config files.

Thanks.

jolson · Post by **jolson** » Tue Jun 02, 2015 12:47 pm

You are on the right track - you will need to modify /etc/rsysyslog.d/90-nagioslogserver_xxx.conf to follow suite with the configuration I posted above. Please note that I've edited my post to make the configuration more accessible. Test this on one of your servers first to ensure it's working, and then perhaps try simulating a failure.

polarbear1 · Post by **polarbear1** » Wed Jun 03, 2015 11:23 am

Thanks for the clarification.

Using the settings as provided --

Code: Select all

$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName papertrailqueue
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 10
$ActionQueueDiscardSeverity 0

See the attached screenshot of what happened after I severed the connection between the server and NLS. Made some notes on it, as well as some questions. I hope I'm reading it correctly.

And some clarifications about the settings --
[*] $ActionQueueFileName needs to be distinct on every log instance I'm monitoring. So if I'm monitoring 9 files (9 configs in /etc/rsyslog.d/) then I have 9 distinct names. This also turns the Queue into a DiskAssisted queue type - for performance sake, over a straight up DiskQueue $ActionQueueType.
[*] $ActionQueueMaxDiskSpace -- My limit is 2g per queue, so if 9 queues, 18g total.

So per my results, I run out of buffer space about 10 minutes? On that note, the disk dumps of the queue go somewhere in the root (/) partition, so I should probably have more than 6gb available?

Still testing to determine where the system begins to stall with the current settings.

jolson · Post by **jolson** » Wed Jun 03, 2015 12:17 pm

The logs that were incoming after you severed the connection to NLS were either cached in logstash or different logs entirely. Where you have written 'Discarded because buffer was full?', I believe what's happening is that rsyslog is queueing the logs up on the client-side. This theory makes sense because when you restored the connection, you had a sudden burst of logs (rsyslog emptying its queue).

And some clarifications about the settings --
[*] $ActionQueueFileName needs to be distinct on every log instance I'm monitoring. So if I'm monitoring 9 files (9 configs in /etc/rsyslog.d/) then I have 9 distinct names. This also turns the Queue into a DiskAssisted queue type - for performance sake, over a straight up DiskQueue $ActionQueueType.
[*] $ActionQueueMaxDiskSpace -- My limit is 2g per queue, so if 9 queues, 18g total.

You are correct about ActionQueueFileName, it will need to be distinct for every configuration. It essentially sets the file name, and tells rsyslog to write to disk.

ActionQueueMaxDiskSpace - the documentation is sparse, but I am under the impression that this queue is limited to the configuration that it's defined in. This means that 9 queues would equal 18g total.

Still testing to determine where the system begins to stall with the current settings.

Let us know what you find out!

polarbear1 · Post by **polarbear1** » Wed Jun 03, 2015 3:14 pm

The results have some variability, but after a few tests I'm satisfied (or rather...very unsatisfied with the results)

So the graph pattern is consistent every time, we get the messages after the connection is restored for all but the last bit of the time we're disconnected, followed by the big spike, before evening back out to normal levels.

Per the 2nd screenshot, there are some logs from that spike, we can see the difference in timestamps between NLS timestamp and the timestamp in the message itself, so I take it those are the messages from that void that are being caught up from some other queue? So the question here is - the logs that come in with the correct timestamps (somewhere between the disconnect and reconnect) vs the logs that come in with the wrong timestamp as part of the burst on the reconnect - which is which (in terms of where they are being buffered) ?

From my tests, the system lasted anywhere from 15 to 30 minutes before performance started to degrade noticeably. This is judging by trying to start up a new ssh instance - if it takes more than 2 seconds to connect, it's stalling. This makes sense. Considering the 80k record limit and my average activity (lets say 2500 every 30 seconds, per the attached graph), that gives us 16 minutes before the buffer is full. But - since we're using a disk assisted queue, after 15 minutes I expect it to start writing some files to disk. I was monitoring disk usage (just doing a #: df -h every few minutes) and there was no increase in at all, when I would have been expecting several gigs written. Also, the system is stalling.

This is making me believe that the queue is not doing what it's supposed to be doing.

In the interest of completeness, here's a sample file out of /etc/rsyslog.d/

Code: Select all

$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for messages
$InputFileName /var/log/messages
$InputFileTag messages:
$InputFileStateFile nls-state-var_log_messages # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
#Buffer Settings
$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName varlogmessages
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 10
$ActionQueueDiscardSeverity 0
if $programname == 'messages' then @@schpnag1:5544
if $programname == 'messages' then ~

Edit - I realized the error of my math. The graph is the total for everything, the 80000 high watermark is per log. So assuming all logs are more or less equally chatty (over 9 logs) .. that's 144 minutes we should be getting. We're nowhere near that. But I just did some testing and I reduced the queue size a 1000 fold...

Code: Select all

$ActionQueueSize 1000
$ActionQueueDiscardMark 975
$ActionQueueHighWaterMark 800

And after a 15 minute disconnect the system began to stall. After reconnecting there was still a large burst at first (17,600 records), and it back-filled only the first 5 minutes at anywhere from 50 to 500 records per 30 seconds. Still nothing going to disk. So going off this test I am inclined to say the buffer size settings are working for us. The discard and the dumping to disk is not. This test was not totally apples to apples in terms of message volume - end of day, the logs as not as chatty as they were earlier.

jolson · Post by **jolson** » Wed Jun 03, 2015 4:40 pm

I have been doing some testing on my end, and this is what I've found.

I haven't yet been able to use the 'imfile' plugin to store cached logs to disk somewhere. I did, however, get /var/log/messages working with the following configuration.

Code: Select all

[root@localhost rsyslog]# cat /etc/rsyslog.d/99-nagioslogserver.conf
### Begin forwarding rule for Nagios Log Server                           NAGIOSLOGSERVER
$WorkDirectory /var/lib/rsyslog # Where spool files will live             NAGIOSLOGSERVER
$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName varlogmessages
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 10
$ActionQueueDiscardSeverity 0
*.* @@192.168.x.x:5544                                               # NAGIOSLOGSERVER
### End of Nagios Log Server forwarding rule                              NAGIOSLOGSERVER

This will store a file in /var/lib/rsyslog/, and the file increases to roughly 2g in size over time.

I'm still doing testing regarding the imfile module, and haven't gotten it working yet - let me know if you make any headway.

I currently believe that dumping to disk is implemented differently regarding the imfile. I'll let you know what I find out.

polarbear1 · Post by **polarbear1** » Tue Jun 09, 2015 4:48 pm

I am still not getting it to write to disk and I am inclined to believe it might be RedHat / rsyslog version related.

I tested the same configuration in a VM (with much newer software) and it performed exactly as expected. My VM test was performed in Fedora 21 with Rsyslog 7.4.10. My production box I'm trying to get it to run on is

Code: Select all

# cat /etc/redhat-release && rsyslogd -v
Red Hat Enterprise Linux Server release 6.6 (Santiago)
rsyslogd 5.8.10, compiled with:
        FEATURE_REGEXP:                         Yes
        FEATURE_LARGEFILE:                      No
        GSSAPI Kerberos 5 support:              Yes
        FEATURE_DEBUG (debug build, slow code): No
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        Runtime Instrumentation (slow code):    No

See http://www.rsyslog.com for more information.

And that's the latest rsyslog available for RHEL6 - so it is what it is. Will open up a ticket with RH to see if they can weigh in on what's happening.

jolson · Post by **jolson** » Wed Jun 10, 2015 9:09 am

Will open up a ticket with RH to see if they can weigh in on what's happening.

That sounds good to me. If you could make us aware of the solution if you find it, I'm sure many people would appreciate it. Thanks!

Jesse

Nagios Support Forum

Rsyslog brings down the house if NLS goes down

Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down

Re: Rsyslog brings down the house if NLS goes down