NSCA 2.9 client problem

Post by **tgriep** » Tue Apr 16, 2019 9:09 am

The Suppressed messages means the system is generating lots of messages and journal is configured to drop some of them. This is called rate limit, and is useful to not overload the logging system.
To get all messages for troubleshooting, you need to increase these limits. This can be achieved by setting the variables RateLimitInterval and RateLimitBurst inside the config file /etc/systemd/journald.conf.
To turn off any kind of rate limiting, set either value to 0.

After changing those settings, see if the messages are logged and post them here.

lhozzan · Post by **lhozzan** » Wed Apr 17, 2019 8:03 am

Hello.

Thank you for advise. I attempt it, but no strange was logged.

Code: Select all

Apr 17 12:43:21 localhost nsca[26319]: Caught SIGTERM - shutting down...
Apr 17 12:43:21 localhost systemd[1]: Stopping NSCA for uk cluster...
Apr 17 12:43:21 localhost nsca[26319]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.
Apr 17 12:43:21 localhost nsca[26319]: Daemon shutdown
Apr 17 12:43:21 localhost systemd[1]: Stopped NSCA for uk cluster.
Apr 17 12:43:21 localhost systemd[1]: Starting NSCA for uk cluster...
Apr 17 12:43:21 localhost systemd[1]: Started NSCA for uk cluster.
Apr 17 12:43:21 localhost nsca[19077]: Starting up daemon
Apr 17 12:43:43 localhost nagios: job 6192 (pid=19268): read() returned error 11
Apr 17 12:43:54 localhost nagios: job 6192 (pid=19364): read() returned error 11
Apr 17 12:48:43 localhost nagios: job 6201 (pid=21905): read() returned error 11
Apr 17 12:48:53 localhost nagios: job 6201 (pid=21990): read() returned error 11
Apr 17 12:48:54 localhost nagios: job 6201 (pid=22004): read() returned error 11
Apr 17 12:48:57 localhost nagios: job 6201 (pid=22039): read() returned error 11
Apr 17 12:50:01 localhost systemd[1]: Started Session 383 of user root.
Apr 17 12:57:05 localhost nagios: job 6215 (pid=26319): read() returned error 11
Apr 17 12:57:55 localhost nsca[19077]: Caught SIGTERM - shutting down...
Apr 17 12:57:55 localhost systemd[1]: Stopping NSCA for uk cluster...
Apr 17 12:57:55 localhost nsca[19077]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.
Apr 17 12:57:55 localhost nsca[19077]: Daemon shutdown
Apr 17 12:57:55 localhost systemd[1]: Stopped NSCA for uk cluster.

Of course, this is output after filtering with grep (see before).

Have you any idea, what to check next for working solution?

Thank you for your effort.

Post by **tgriep** » Wed Apr 17, 2019 8:56 am

Check the permissions of where the NSCA PID file is created.

Apr 17 12:43:21 localhost nsca[26319]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.

Other than that, the logs don't show much other that the daemon starting and stopping.
Question, did you go back to running the NSCA server as a daemon or left it to run out of xinetd?

Do this, when there are stuck connections on the Nagios server, note the IP addresses.
Go to the remote systems at those IP addressed and see if the send_nsca application is still running and holding open the connection.
If so, stop it from running and see if the connection is closed on the Nagios server.

lhozzan · Post by **lhozzan** » Thu Apr 18, 2019 2:45 am

Hello.

tgriep wrote:Check the permissions of where the NSCA PID file is created.

Code: Select all

-rw-r--r--  1 nagios nagios    5 apr 15 07:07 nsca_uk.pid

I think, this warning we can ignore. This is only bounded to shuting down. When shutdown occured, this PID file is persistant, but when process is started, to this file is placed correct PID. If you wish, i can change unit file and place this PID file to another location.

Other than that, the logs don't show much other that the daemon starting and stopping.

Unfortunatelly yes. I not have any idea, what is wrong, what is reason, why is opened too many CLOSE_WAIT connects and what attempt next.

Question, did you go back to running the NSCA server as a daemon or left it to run out of xinetd?

Yes, running as a daemon direct under systemd. When running it under xinetd cost huge amount of CPU power.

tgriep wrote:Do this, when there are stuck connections on the Nagios server, note the IP addresses.
Go to the remote systems at those IP addressed and see if the send_nsca application is still running and holding open the connection.
If so, stop it from running and see if the connection is closed on the Nagios server.

So, you want to let NSCA take all possible connections and next investigate, that is on client side some holding connections? Just question for clarify.

Thank you for your effort.

Post by **tgriep** » Thu Apr 18, 2019 1:37 pm

Your question
"So, you want to let NSCA take all possible connections and next investigate, that is on client side some holding connections?"
Is yes, setup a client with NSCA 2.9.2 and see if that server causes the issue to happen, then if so, check the client's log files to see if there are any errors there.

lhozzan · Post by **lhozzan** » Tue May 07, 2019 11:50 pm

Hello.

My apologize for delay, was on vacation.

tgriep wrote:Is yes, setup a client with NSCA 2.9.2 and see if that server causes the issue to happen, then if so, check the client's log files to see if there are any errors there.

As i expected, this NSCA thread fail. But strange is, not fail on connection problems (see my attachment), but on limits with opened files.

Code: Select all

[operator@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for uk cluster
   Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
   Active: active (running) since Ut 2019-04-30 10:41:54 UTC; 1 weeks 0 days ago
 Main PID: 32755 (nsca_uk)
   CGroup: /system.slice/nsca_uk.service
           └─32755 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg

may 08 03:37:01 server nsca[32755]: Network server accept failure (24: Too many open files)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Code: Select all

[operator@server ~]$ cat /proc/32755/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31824                31824                processes 
Max open files            8000                 8000                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31824                31824                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

On client side i can see from collectd logs, that connections isnt possible:

Code: Select all

may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660
may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660
may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660

So, rising this limits again (16000) and see, what happened. Or have you any other suggestions?

Thank you for your effort.

Post by **tgriep** » Wed May 08, 2019 11:47 am

No, Increasing the open files limits is what I would of suggested.
Let us know if this helps.

lhozzan · Post by **lhozzan** » Fri May 10, 2019 1:56 am

Hello.

Today i check situation and it seems, that problem should be in somewhere in NSCA server 2.9, when handling connections from NSCA client 2.9. After rising limit for opened files today situation looks:

Code: Select all

[root@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for de cluster
   Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
   Active: active (running) since Št 2019-05-09 06:44:17 UTC; 23h ago
 Main PID: 30492 (nsca_uk)
   CGroup: /system.slice/nsca_uk.service
           └─30492 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg

may 09 06:44:17 server systemd[1]: Starting NSCA for de cluster...
may 09 06:44:17 server nsca[30492]: Starting up daemon
may 09 06:44:17 server systemd[1]: Started NSCA for de cluster.

[root@server ~]$ lsof -a -p 30492 | wc -l
8097

So, NSCA again hit previous limit 8000.

I attempt investigate, what files are opened:

Code: Select all

[root@server ~]$ lsof -a -p 30492
COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF       NODE NAME
nsca_uk 30492 nagios  cwd    DIR                8,0     4096          2 /
nsca_uk 30492 nagios  rtd    DIR                8,0     4096          2 /
nsca_uk 30492 nagios  txt    REG                8,0    51464      31798 /usr/sbin/nsca
nsca_uk 30492 nagios  mem    REG                8,0    61624       4665 /usr/lib64/libnss_files-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0  2151672       4647 /usr/lib64/libc-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0   115848       4657 /usr/lib64/libnsl-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0   187952      31794 /usr/lib64/libmcrypt.so.4.4.8
nsca_uk 30492 nagios  mem    REG                8,0   163400       4640 /usr/lib64/ld-2.17.so
nsca_uk 30492 nagios    0r   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    1w   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    2w   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    3u  unix 0x000000007063482c      0t0 1599797206 socket
nsca_uk 30492 nagios    4u  IPv4         1599795103      0t0        TCP *:5660 (LISTEN)
...
nsca_uk 30492 nagios    5u  sock                0,9      0t0 1599814373 protocol: TCP
nsca_uk 30492 nagios    6u  sock                0,9      0t0 1599802987 protocol: TCP
nsca_uk 30492 nagios    7u  sock                0,9      0t0 1599885192 protocol: TCP
nsca_uk 30492 nagios    8u  sock                0,9      0t0 1599800045 protocol: TCP
nsca_uk 30492 nagios    9u  sock                0,9      0t0 1599890841 protocol: TCP
nsca_uk 30492 nagios   10u  sock                0,9      0t0 1599984874 protocol: TCP
...
nsca_uk 30492 nagios 7897u  IPv4         1665232945      0t0        TCP server:5660->li1491-6.members.linode.com:50596 (CLOSE_WAIT)
nsca_uk 30492 nagios 7898u  IPv4         1665239928      0t0        TCP server:5660->li1424-189.members.linode.com:46979 (CLOSE_WAIT)
nsca_uk 30492 nagios 7899u  IPv4         1665230120      0t0        TCP server:5660->li1424-189.members.linode.com:41324 (CLOSE_WAIT)
nsca_uk 30492 nagios 7900u  IPv4         1665247369      0t0        TCP server:5660->li1651-122.members.linode.com:34818 (CLOSE_WAIT)
nsca_uk 30492 nagios 7901u  IPv4         1665242660      0t0        TCP server:5660->li1413-76.members.linode.com:59288 (CLOSE_WAIT)
nsca_uk 30492 nagios 7902u  IPv4         1665256045      0t0        TCP server:5660->i1674-121.members.linode.com:60338 (CLOSE_WAIT)
nsca_uk 30492 nagios 7903u  IPv4         1665247740      0t0        TCP server:5660->li1417-217.members.linode.com:54132 (CLOSE_WAIT)
nsca_uk 30492 nagios 7904u  IPv4         1665248200      0t0        TCP server:5660->li1414-52.members.linode.com:34966 (CLOSE_WAIT)
...

Here are huge amount of sock and IPv4 type. I think, sock type is wrong holded files handlers and it seems, that IPv4 too. IPv4 type is in state CloseWait, which i can see in graph (see attachment).

Let see, what happen to tomorrow, but i afraid, that situation not change and NSCA hang on opened files limit.

If you have any another idea or tihngs to check, please, let me know.

Thank you for your effort.

Post by **tgriep** » Fri May 10, 2019 11:24 am

The only other suggestion I have is to install the NSCA package from source.
It might be something in how the package you are using that is causing the issue.
https://github.com/NagiosEnterprises/nsca

When you check the NSCA server, are the stuck connections coming from a few servers or many servers?

lhozzan · Post by **lhozzan** » Sat May 11, 2019 2:54 am

Hello.

As i expected, today NSCA hang again on same problem:

Code: Select all

[root@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for de cluster
   Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
   Active: active (running) since Št 2019-05-09 06:44:17 UTC; 2 days ago
 Main PID: 30492 (nsca_uk)
   CGroup: /system.slice/nsca_uk.service
           └─30492 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg

may 09 06:44:17 server systemd[1]: Starting NSCA for de cluster...
may 09 06:44:17 server nsca[30492]: Starting up daemon
may 09 06:44:17 server systemd[1]: Started NSCA for de cluster.
may 11 01:29:11 server nsca[30492]: Network server accept failure (24: Too many open files)

Code: Select all

[root@server ~]$ lsof -a -p 30492 | wc -l
15764

Code: Select all

[root@server ~]$ lsof -a -p 30492
COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF       NODE NAME
nsca_uk 30492 nagios  cwd    DIR                8,0     4096          2 /
nsca_uk 30492 nagios  rtd    DIR                8,0     4096          2 /
nsca_uk 30492 nagios  txt    REG                8,0    51464      31798 /usr/sbin/nsca
nsca_uk 30492 nagios  mem    REG                8,0    61624       4665 /usr/lib64/libnss_files-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0  2151672       4647 /usr/lib64/libc-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0   115848       4657 /usr/lib64/libnsl-2.17.so
nsca_uk 30492 nagios  mem    REG                8,0   187952      31794 /usr/lib64/libmcrypt.so.4.4.8
nsca_uk 30492 nagios  mem    REG                8,0   163400       4640 /usr/lib64/ld-2.17.so
nsca_uk 30492 nagios    0r   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    1w   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    2w   CHR                1,3      0t0       1061 /dev/null
nsca_uk 30492 nagios    3u  unix 0x000000007063482c      0t0 1599797206 socket
nsca_uk 30492 nagios    5u  sock                0,9      0t0 1599814373 protocol: TCP
nsca_uk 30492 nagios    6u  sock                0,9      0t0 1599802987 protocol: TCP
nsca_uk 30492 nagios    7u  sock                0,9      0t0 1599885192 protocol: TCP
nsca_uk 30492 nagios    8u  sock                0,9      0t0 1599800045 protocol: TCP
nsca_uk 30492 nagios    9u  sock                0,9      0t0 1599890841 protocol: TCP
nsca_uk 30492 nagios   10u  sock                0,9      0t0 1599984874 protocol: TCP
nsca_uk 30492 nagios   11u  sock                0,9      0t0 1599994314 protocol: TCP
nsca_uk 30492 nagios   12u  sock                0,9      0t0 1600026625 protocol: TCP
...

I can see only few IPv4 type handler, but huge amount of sock type handler. This problem occured only when at last one client have NSCA v2.9.

Meantime my colleague attempt another solution, fork from NSCA and looks very usable (NSCA-NG). Actually here is sending about 200 servers their notifications and no problem spoted. I know, that this product isnt in your portfolio.

Many thanks for your effort, please, consider this case as solved and lock this topic. As a solution for us it seems is better NSCA-NG, like newer version of NSCA.

Best regards.

Nagios Support Forum

NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem

Re: NSCA 2.9 client problem