NSCA 2.9 client problem
-
- Madmin
- Posts: 9190
- Joined: Thu Oct 30, 2014 9:02 am
Re: NSCA 2.9 client problem
The Suppressed messages means the system is generating lots of messages and journal is configured to drop some of them. This is called rate limit, and is useful to not overload the logging system.
To get all messages for troubleshooting, you need to increase these limits. This can be achieved by setting the variables RateLimitInterval and RateLimitBurst inside the config file /etc/systemd/journald.conf.
To turn off any kind of rate limiting, set either value to 0.
After changing those settings, see if the messages are logged and post them here.
To get all messages for troubleshooting, you need to increase these limits. This can be achieved by setting the variables RateLimitInterval and RateLimitBurst inside the config file /etc/systemd/journald.conf.
To turn off any kind of rate limiting, set either value to 0.
After changing those settings, see if the messages are logged and post them here.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 21
- Joined: Wed Mar 20, 2019 10:43 am
Re: NSCA 2.9 client problem
Hello.
Thank you for advise. I attempt it, but no strange was logged.
Of course, this is output after filtering with grep (see before).
Have you any idea, what to check next for working solution?
Thank you for your effort.
Thank you for advise. I attempt it, but no strange was logged.
Code: Select all
Apr 17 12:43:21 localhost nsca[26319]: Caught SIGTERM - shutting down...
Apr 17 12:43:21 localhost systemd[1]: Stopping NSCA for uk cluster...
Apr 17 12:43:21 localhost nsca[26319]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.
Apr 17 12:43:21 localhost nsca[26319]: Daemon shutdown
Apr 17 12:43:21 localhost systemd[1]: Stopped NSCA for uk cluster.
Apr 17 12:43:21 localhost systemd[1]: Starting NSCA for uk cluster...
Apr 17 12:43:21 localhost systemd[1]: Started NSCA for uk cluster.
Apr 17 12:43:21 localhost nsca[19077]: Starting up daemon
Apr 17 12:43:43 localhost nagios: job 6192 (pid=19268): read() returned error 11
Apr 17 12:43:54 localhost nagios: job 6192 (pid=19364): read() returned error 11
Apr 17 12:48:43 localhost nagios: job 6201 (pid=21905): read() returned error 11
Apr 17 12:48:53 localhost nagios: job 6201 (pid=21990): read() returned error 11
Apr 17 12:48:54 localhost nagios: job 6201 (pid=22004): read() returned error 11
Apr 17 12:48:57 localhost nagios: job 6201 (pid=22039): read() returned error 11
Apr 17 12:50:01 localhost systemd[1]: Started Session 383 of user root.
Apr 17 12:57:05 localhost nagios: job 6215 (pid=26319): read() returned error 11
Apr 17 12:57:55 localhost nsca[19077]: Caught SIGTERM - shutting down...
Apr 17 12:57:55 localhost systemd[1]: Stopping NSCA for uk cluster...
Apr 17 12:57:55 localhost nsca[19077]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.
Apr 17 12:57:55 localhost nsca[19077]: Daemon shutdown
Apr 17 12:57:55 localhost systemd[1]: Stopped NSCA for uk cluster.
Have you any idea, what to check next for working solution?
Thank you for your effort.
-
- Madmin
- Posts: 9190
- Joined: Thu Oct 30, 2014 9:02 am
Re: NSCA 2.9 client problem
Check the permissions of where the NSCA PID file is created.
Question, did you go back to running the NSCA server as a daemon or left it to run out of xinetd?
Do this, when there are stuck connections on the Nagios server, note the IP addresses.
Go to the remote systems at those IP addressed and see if the send_nsca application is still running and holding open the connection.
If so, stop it from running and see if the connection is closed on the Nagios server.
Other than that, the logs don't show much other that the daemon starting and stopping.Apr 17 12:43:21 localhost nsca[26319]: Cannot remove pidfile '/var/run/nsca_uk.pid' - check your privileges.
Question, did you go back to running the NSCA server as a daemon or left it to run out of xinetd?
Do this, when there are stuck connections on the Nagios server, note the IP addresses.
Go to the remote systems at those IP addressed and see if the send_nsca application is still running and holding open the connection.
If so, stop it from running and see if the connection is closed on the Nagios server.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 21
- Joined: Wed Mar 20, 2019 10:43 am
Re: NSCA 2.9 client problem
Hello.
I think, this warning we can ignore. This is only bounded to shuting down. When shutdown occured, this PID file is persistant, but when process is started, to this file is placed correct PID. If you wish, i can change unit file and place this PID file to another location.
Thank you for your effort.
tgriep wrote:Check the permissions of where the NSCA PID file is created.
Code: Select all
-rw-r--r-- 1 nagios nagios 5 apr 15 07:07 nsca_uk.pid
Unfortunatelly yes. I not have any idea, what is wrong, what is reason, why is opened too many CLOSE_WAIT connects and what attempt next.Other than that, the logs don't show much other that the daemon starting and stopping.
Yes, running as a daemon direct under systemd. When running it under xinetd cost huge amount of CPU power.Question, did you go back to running the NSCA server as a daemon or left it to run out of xinetd?
So, you want to let NSCA take all possible connections and next investigate, that is on client side some holding connections? Just question for clarify.tgriep wrote:Do this, when there are stuck connections on the Nagios server, note the IP addresses.
Go to the remote systems at those IP addressed and see if the send_nsca application is still running and holding open the connection.
If so, stop it from running and see if the connection is closed on the Nagios server.
Thank you for your effort.
-
- Madmin
- Posts: 9190
- Joined: Thu Oct 30, 2014 9:02 am
Re: NSCA 2.9 client problem
Your question
"So, you want to let NSCA take all possible connections and next investigate, that is on client side some holding connections?"
Is yes, setup a client with NSCA 2.9.2 and see if that server causes the issue to happen, then if so, check the client's log files to see if there are any errors there.
"So, you want to let NSCA take all possible connections and next investigate, that is on client side some holding connections?"
Is yes, setup a client with NSCA 2.9.2 and see if that server causes the issue to happen, then if so, check the client's log files to see if there are any errors there.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 21
- Joined: Wed Mar 20, 2019 10:43 am
Re: NSCA 2.9 client problem
Hello.
My apologize for delay, was on vacation.
On client side i can see from collectd logs, that connections isnt possible:
So, rising this limits again (16000) and see, what happened. Or have you any other suggestions?
Thank you for your effort.
My apologize for delay, was on vacation.
As i expected, this NSCA thread fail. But strange is, not fail on connection problems (see my attachment), but on limits with opened files.tgriep wrote:Is yes, setup a client with NSCA 2.9.2 and see if that server causes the issue to happen, then if so, check the client's log files to see if there are any errors there.
Code: Select all
[operator@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for uk cluster
Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
Active: active (running) since Ut 2019-04-30 10:41:54 UTC; 1 weeks 0 days ago
Main PID: 32755 (nsca_uk)
CGroup: /system.slice/nsca_uk.service
└─32755 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
may 08 03:37:01 server nsca[32755]: Network server accept failure (24: Too many open files)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Code: Select all
[operator@server ~]$ cat /proc/32755/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31824 31824 processes
Max open files 8000 8000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31824 31824 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
On client side i can see from collectd logs, that connections isnt possible:
Code: Select all
may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660
may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660
may 08 04:26:28 tools_server collectd[30919]: Connection refused by host
may 08 04:26:28 tools_server collectd[30919]: Error: Could not connect to host 45.33.80.18 on port 5660
Thank you for your effort.
You do not have the required permissions to view the files attached to this post.
-
- Madmin
- Posts: 9190
- Joined: Thu Oct 30, 2014 9:02 am
Re: NSCA 2.9 client problem
No, Increasing the open files limits is what I would of suggested.
Let us know if this helps.
Let us know if this helps.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 21
- Joined: Wed Mar 20, 2019 10:43 am
Re: NSCA 2.9 client problem
Hello.
Today i check situation and it seems, that problem should be in somewhere in NSCA server 2.9, when handling connections from NSCA client 2.9. After rising limit for opened files today situation looks:
So, NSCA again hit previous limit 8000.
I attempt investigate, what files are opened:
Here are huge amount of sock and IPv4 type. I think, sock type is wrong holded files handlers and it seems, that IPv4 too. IPv4 type is in state CloseWait, which i can see in graph (see attachment).
Let see, what happen to tomorrow, but i afraid, that situation not change and NSCA hang on opened files limit.
If you have any another idea or tihngs to check, please, let me know.
Thank you for your effort.
Today i check situation and it seems, that problem should be in somewhere in NSCA server 2.9, when handling connections from NSCA client 2.9. After rising limit for opened files today situation looks:
Code: Select all
[root@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for de cluster
Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
Active: active (running) since Št 2019-05-09 06:44:17 UTC; 23h ago
Main PID: 30492 (nsca_uk)
CGroup: /system.slice/nsca_uk.service
└─30492 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
may 09 06:44:17 server systemd[1]: Starting NSCA for de cluster...
may 09 06:44:17 server nsca[30492]: Starting up daemon
may 09 06:44:17 server systemd[1]: Started NSCA for de cluster.
[root@server ~]$ lsof -a -p 30492 | wc -l
8097
I attempt investigate, what files are opened:
Code: Select all
[root@server ~]$ lsof -a -p 30492
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nsca_uk 30492 nagios cwd DIR 8,0 4096 2 /
nsca_uk 30492 nagios rtd DIR 8,0 4096 2 /
nsca_uk 30492 nagios txt REG 8,0 51464 31798 /usr/sbin/nsca
nsca_uk 30492 nagios mem REG 8,0 61624 4665 /usr/lib64/libnss_files-2.17.so
nsca_uk 30492 nagios mem REG 8,0 2151672 4647 /usr/lib64/libc-2.17.so
nsca_uk 30492 nagios mem REG 8,0 115848 4657 /usr/lib64/libnsl-2.17.so
nsca_uk 30492 nagios mem REG 8,0 187952 31794 /usr/lib64/libmcrypt.so.4.4.8
nsca_uk 30492 nagios mem REG 8,0 163400 4640 /usr/lib64/ld-2.17.so
nsca_uk 30492 nagios 0r CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 1w CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 2w CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 3u unix 0x000000007063482c 0t0 1599797206 socket
nsca_uk 30492 nagios 4u IPv4 1599795103 0t0 TCP *:5660 (LISTEN)
...
nsca_uk 30492 nagios 5u sock 0,9 0t0 1599814373 protocol: TCP
nsca_uk 30492 nagios 6u sock 0,9 0t0 1599802987 protocol: TCP
nsca_uk 30492 nagios 7u sock 0,9 0t0 1599885192 protocol: TCP
nsca_uk 30492 nagios 8u sock 0,9 0t0 1599800045 protocol: TCP
nsca_uk 30492 nagios 9u sock 0,9 0t0 1599890841 protocol: TCP
nsca_uk 30492 nagios 10u sock 0,9 0t0 1599984874 protocol: TCP
...
nsca_uk 30492 nagios 7897u IPv4 1665232945 0t0 TCP server:5660->li1491-6.members.linode.com:50596 (CLOSE_WAIT)
nsca_uk 30492 nagios 7898u IPv4 1665239928 0t0 TCP server:5660->li1424-189.members.linode.com:46979 (CLOSE_WAIT)
nsca_uk 30492 nagios 7899u IPv4 1665230120 0t0 TCP server:5660->li1424-189.members.linode.com:41324 (CLOSE_WAIT)
nsca_uk 30492 nagios 7900u IPv4 1665247369 0t0 TCP server:5660->li1651-122.members.linode.com:34818 (CLOSE_WAIT)
nsca_uk 30492 nagios 7901u IPv4 1665242660 0t0 TCP server:5660->li1413-76.members.linode.com:59288 (CLOSE_WAIT)
nsca_uk 30492 nagios 7902u IPv4 1665256045 0t0 TCP server:5660->i1674-121.members.linode.com:60338 (CLOSE_WAIT)
nsca_uk 30492 nagios 7903u IPv4 1665247740 0t0 TCP server:5660->li1417-217.members.linode.com:54132 (CLOSE_WAIT)
nsca_uk 30492 nagios 7904u IPv4 1665248200 0t0 TCP server:5660->li1414-52.members.linode.com:34966 (CLOSE_WAIT)
...
Let see, what happen to tomorrow, but i afraid, that situation not change and NSCA hang on opened files limit.
If you have any another idea or tihngs to check, please, let me know.
Thank you for your effort.
You do not have the required permissions to view the files attached to this post.
-
- Madmin
- Posts: 9190
- Joined: Thu Oct 30, 2014 9:02 am
Re: NSCA 2.9 client problem
The only other suggestion I have is to install the NSCA package from source.
It might be something in how the package you are using that is causing the issue.
https://github.com/NagiosEnterprises/nsca
When you check the NSCA server, are the stuck connections coming from a few servers or many servers?
It might be something in how the package you are using that is causing the issue.
https://github.com/NagiosEnterprises/nsca
When you check the NSCA server, are the stuck connections coming from a few servers or many servers?
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 21
- Joined: Wed Mar 20, 2019 10:43 am
Re: NSCA 2.9 client problem
Hello.
As i expected, today NSCA hang again on same problem:
I can see only few IPv4 type handler, but huge amount of sock type handler. This problem occured only when at last one client have NSCA v2.9.
Meantime my colleague attempt another solution, fork from NSCA and looks very usable (NSCA-NG). Actually here is sending about 200 servers their notifications and no problem spoted. I know, that this product isnt in your portfolio.
Many thanks for your effort, please, consider this case as solved and lock this topic. As a solution for us it seems is better NSCA-NG, like newer version of NSCA.
Best regards.
As i expected, today NSCA hang again on same problem:
Code: Select all
[root@server ~]$ service nsca_uk status
Redirecting to /bin/systemctl status nsca_uk.service
● nsca_uk.service - NSCA for de cluster
Loaded: loaded (/etc/systemd/system/nsca_uk.service; enabled; vendor preset: disabled)
Active: active (running) since Št 2019-05-09 06:44:17 UTC; 2 days ago
Main PID: 30492 (nsca_uk)
CGroup: /system.slice/nsca_uk.service
└─30492 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
may 09 06:44:17 server systemd[1]: Starting NSCA for de cluster...
may 09 06:44:17 server nsca[30492]: Starting up daemon
may 09 06:44:17 server systemd[1]: Started NSCA for de cluster.
may 11 01:29:11 server nsca[30492]: Network server accept failure (24: Too many open files)
Code: Select all
[root@server ~]$ lsof -a -p 30492 | wc -l
15764
Code: Select all
[root@server ~]$ lsof -a -p 30492
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nsca_uk 30492 nagios cwd DIR 8,0 4096 2 /
nsca_uk 30492 nagios rtd DIR 8,0 4096 2 /
nsca_uk 30492 nagios txt REG 8,0 51464 31798 /usr/sbin/nsca
nsca_uk 30492 nagios mem REG 8,0 61624 4665 /usr/lib64/libnss_files-2.17.so
nsca_uk 30492 nagios mem REG 8,0 2151672 4647 /usr/lib64/libc-2.17.so
nsca_uk 30492 nagios mem REG 8,0 115848 4657 /usr/lib64/libnsl-2.17.so
nsca_uk 30492 nagios mem REG 8,0 187952 31794 /usr/lib64/libmcrypt.so.4.4.8
nsca_uk 30492 nagios mem REG 8,0 163400 4640 /usr/lib64/ld-2.17.so
nsca_uk 30492 nagios 0r CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 1w CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 2w CHR 1,3 0t0 1061 /dev/null
nsca_uk 30492 nagios 3u unix 0x000000007063482c 0t0 1599797206 socket
nsca_uk 30492 nagios 5u sock 0,9 0t0 1599814373 protocol: TCP
nsca_uk 30492 nagios 6u sock 0,9 0t0 1599802987 protocol: TCP
nsca_uk 30492 nagios 7u sock 0,9 0t0 1599885192 protocol: TCP
nsca_uk 30492 nagios 8u sock 0,9 0t0 1599800045 protocol: TCP
nsca_uk 30492 nagios 9u sock 0,9 0t0 1599890841 protocol: TCP
nsca_uk 30492 nagios 10u sock 0,9 0t0 1599984874 protocol: TCP
nsca_uk 30492 nagios 11u sock 0,9 0t0 1599994314 protocol: TCP
nsca_uk 30492 nagios 12u sock 0,9 0t0 1600026625 protocol: TCP
...
Meantime my colleague attempt another solution, fork from NSCA and looks very usable (NSCA-NG). Actually here is sending about 200 servers their notifications and no problem spoted. I know, that this product isnt in your portfolio.
Many thanks for your effort, please, consider this case as solved and lock this topic. As a solution for us it seems is better NSCA-NG, like newer version of NSCA.
Best regards.