No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15 SP3

mike-d-howell · Post by **mike-d-howell** » Tue Jul 27, 2021 5:04 pm

Hello all -

I'm having trouble with ncpa_listener following upgrading servers and the listener client. Servers are SLES 15 SP2. Client versions are 2.2.1 thru 2.3.1.

I performed an in-place upgrade on SLES 15 SP2 to SP3 running ncpa 2.2.0. Afterwards, the NagiosXI server and web GUI began reporting an error "Referencing node that does not exist". In the web gui, you can look at the API page (any endpoint) to see it.

I upgraded the ncpa client to 2.3.1 using the picker on the nagios.com/ncpa page and the command "rpm -Uvh package_name.rpm" listed in the documentation.

The Now, under the API page, all endpoints have metrics, except Services. Services now only display:

Code: Select all

{
    "services": {
        "": "stopped",
        "*": "stopped"
    }
}

I upgraded other servers to the SP3 group, and tried ncpa versions between 2.2.0 and 2.3.1. The bug appeared for me in 2.2.1.

I opened a github issue #791 (https://github.com/NagiosEnterprises/ncpa/issues/791), but progress seems to have stalled. I am hoping to begin preparing my servers for upgrade before SP2 obsolesces later this year. We really rely on Nagios for monitoring, and I would hate to lose that functionality. Any help is greatly appreciated!

Thanks - Mike

PS the tech who was troubleshooting on the gh issue remarked that I had installed the wrong os version. I stopped testing on that server and moved to a fresh one to continue. The results remain.

Post by **vtrac** » Wed Jul 28, 2021 1:25 pm

Hi Mike,
Hope you are having a great Wednesday!! ...

Looks like the issue started after you upgraded the OS.

Were the OS upgraded on the Nagios XI machine or the NCPA agent machine?

Could you please show me the command used? ... and the errors?

You can get the command from the "Run check command" button:
Open the Nagios XI GUI > Configure > Core Config Manager > Services
Select the service that you are having issue > click the "Run check command" button.

Also, please upload the "ncpa.cfg" file as well.

If you could share the screenshot of the error displayed in the Nagios XI services page that would helps.

What version of Python used on both, your Nagios XI and the NCPA remote agent?

Best Regards,
Vinh

mike-d-howell · Post by **mike-d-howell** » Fri Jul 30, 2021 11:41 am

Hi Vinh,
Thanks for the reply. I've tried to answer parenthetically below...

Looks like the issue started after you upgraded the OS.
- Yes. After upgrading one of my SLES 15 SP2 servers to SP3, the 2.2.0 agent was reporting node does not exist.
- I then upgraded the agent to 2.3.1 and although the first error went away, the issue of no services listing appeared.
- In testing, I took a different SLES 15 SP2, upgraded it to SP3, and then installed NCPA 2.3.1 directly. The no-services behavior persists.

Were the OS upgraded on the Nagios XI machine or the NCPA agent machine?
- The upgrade was on a server that is monitored by NCPA agent. The Nagios server is the CentOS appliance 5.8.4.

Could you please show me the command used? ... and the errors?
-- On the first server: rpm -Uvh https://assets.nagios.com/downloads/ncp ... x86_64.rpm
--- I was told this was the wrong agent, so I started over with a second server...
-- On the second server, I downloaded and ran this: rpm -Uvh ncpa-2.3.1.sle15.x86_64.rpm
-- I also tried a third server, and found the services error seems to begin at ncpa 2.2.1 with sles15 sp3.

Code: Select all

UNKNOWN: No services found for service names: postgresql-10.service

You can get the command from the "Run check command" button:
Open the Nagios XI GUI > Configure > Core Config Manager > Services
Select the service that you are having issue > click the "Run check command" button.
-- [nagios@scanner.osufpp.org ~]$ /usr/local/nagios/libexec/check_ncpa.py -H drupshop-db-replica-01.osufpp.org -t 'our-lovely-secret-hash' -P 5693 -M 'services' -q 'service=postgresql-10.service,status=running'
-- UNKNOWN: No services found for service names: postgresql-10.service

Also, please upload the "ncpa.cfg" file as well.

Code: Select all

drupshop-db-rep:/usr/local/ncpa/etc # cat ncpa.cfg
#
#   NCPA Main Config File
#   ---------------------
#

#
# -------------------------------
# General Configuration
# -------------------------------
#

[general]

#
# Check logging (in ncpa.db and the interface) is on by default, you can disable it
# if you do not want to record the check requests that are coming in or checks being
# sent over NRDP.
# Default: check_logging = 1
#
check_logging = 1

#
# Check logging time - how long in DAYS you'd like to keep checks in the database.
# Default: 30
#
check_logging_time = 30

#
# Display all mounted disk partitions
# (essentially setting all=True here: https://psutil.readthedocs.io/en/latest/#psutil.disk_partitions)
# Default: 1
#
all_partitions = 1

#
# Excluded file system types removes these fs types from the disk metrics
# (This is mostly only noteable on UNIX systems but also works on Windows if you need it)
# Default: aufs,autofs,binfmt_misc,cifs,cgroup,configfs,debugfs,devpts,devtmpfs,
#          encryptfs,efivarfs,fuse,fusectl,hugetlbfs,mqueue,nfs,overlayfs,proc,pstore,
#          rpc_pipefs,securityfs,selinuxfs,smb,sysfs,tmpfs,tracefs
#
exclude_fs_types = aufs,autofs,binfmt_misc,cifs,cgroup,configfs,debugfs,devpts,devtmpfs,encryptfs,efivarfs,fuse,fusectl,hugetlbfs,mqueue,nfs,overlayfs,proc,pstore,rpc_pipefs,securityfs,selinuxfs,smb,sysfs,tmpfs,tracefs

#
# The default unit to convert bytes (B) into if no unit is specified
# (Gi = 1024 MiB, G = 1000 MB)
#
default_units = Gi

#
# -------------------------------
# Listener Configuration (daemon)
# -------------------------------
#

[listener]

#
# User and group to run plugins as (recommended to use nagios:nagios)
# Default: uid = nagios
# Default: gid = nagios
#
# ** Note - The daemon runs as root, but forks a child process when running a plugin
#    that is defined by the user, for security reasons. However, without the main daemon
#    running as root, much of the system information would be missing. This is typical behavior. **
#
# This is for Unix only (Linux, Mac OS X, etc)
#
uid = root
#nagios
gid = root
#nagios

#
# IP address and port number for the Listener to use for the web GUI and API
#
# :: allows for dual stack (IPv4 and IPv6 on most linux systems) but will only allow
# for IPv6 connections on Windows
# 0.0.0.0 allows for IPv4 connections only on Windows and most linux systems
#
# Default: ip = ::
# Default (Windows): ip = 0.0.0.0
# Default: port = 5693
#
# ip =
ip=172.16.1.22
# port =

#
# SSL connection and certificate config (if an SSL option is not available on some older
# operating systems it will default back to TLSv1)
# ssl_version options: TLSv1, TLSv1_1, TLSv1_2
#
# ssl_ciphers = <list of ciphers>
#
ssl_version = TLSv1_2
certificate = adhoc

#
# Listener logging file level, location, and the PID location
# Default: loglevel = info (debug, info, warning, error)
# Default: logfile = var/log/ncpa_listener.log
# Default: pidfile = var/run/ncpa_listener.pid (leave listener in pid file name)
#
loglevel = info
logfile = var/log/ncpa_listener.log
pidfile = var/run/ncpa_listener.pid

#
# Delay the listener (API & web GUI) from starting in seconds
# Default: 0
#
# delay_start = 30

#
# Allow admin functionality in the web GUI. When this is set to 0, the admin section will not
# be displayed in the header and will not be available to be accessed.
# Default: 1
#
admin_gui_access = 1

#
# Admin password for the admin section in the web GUI, by default there is no admin
# password and the admin section of the GUI can be accessed by anyone if admin_gui_access is set to 1.
# Default: None
#
# Note: Setting this value to 'None' will automatically log you in, setting it empty will allow you to
# log in using a blank password.
#
admin_password = None

#
# Require admin password to access ALL of the web GUI.
# This does not affect API access via token (community_string).
# Default: 0
#
admin_auth_only = 0

#
# Comma separated list of allowed hosts that can access the API (and GUI)
# Supported types: IPv4, IPv4-mapped IPv6, IPv6, hostnames
# Hostname wildcards are not supported.
#
# Exmaple IPv4: 192.168.23.15
# Example IPv4 subnet: 192.168.0.0/28
# Example IPv4-mapped IPv6: ::ffff:192.168.1.15
# Example IPv6: 2001:0db8:85a3:0000:0000:8a2e:0370:7334
# Example hostname: asterisk.mydomain.com
# Example mixed types: 192.168.23.15, 192.168.0.0/28, ::ffff:192.168.1.15, 2001:0db8:85a3:0000:0000:8a2e:0370:7334, asterisk.mydomain.com
#
# allowed_hosts =
allowed_hosts =172.16.0.0/16,172.20.0.0/16

#
# Number of maximum concurrent connections to the NCPA server.
# Use "None" for unlimited. Default is 200.
# Example: 200
#
# max_connections =

#
# Set the URL to use in the X-Frame-Options and Content-Security-Policy headers
# in order to enable the NCPA GUI to be allowed to load into a frame
# Default: None
# Example: mycoolwebsite.com
# Example: *.mycoolwebsite.com
#
# allowed_sources =

#
# The max size allowed for a log file in megabytes.
# When the log becomes larger than this, the log will be rolled over
# and a new log will be started.
# Default: 5
#
# logmaxmb =

#
# The max number of log rollovers that will be kept.
# Default: 5
#
# logbackups =

#
# -------------------------------
# Listener Configuration (API)
# -------------------------------
#

[api]

#
# The token that will be used to log into the basic web GUI (API browser, graphs, top charts, etc)
# and to authenticate requests to the API and requests through check_ncpa.py
#
community_string = our-lovely-secret-hash

#
# -------------------------------
# Passive Configuration (daemon)
# -------------------------------
#

[passive]

#
# Handlers are a comma separated list of what you would like the passive agent to run
# Default: None
# Options:
#   nrdp, kafkaproducer
#
# Example:
# handlers = nrdp,kafkaproducer
#
handlers = None

#
# User and group to run passive checks as (Recommended to use nagios:nagios)
# Default: uid = nagios
# Default: gid = nagios
#
# This is for Unix only (Linux, Mac OS X, etc)
#
uid = root
#nagios
gid = root
#nagios

#
# Passive check interval - the amount in seconds to wait between each passive check by default,
# this can be overwritten by adding on a "|<duration>" in seconds to the passive check config
# Default: 300 (5 minutes)
#
sleep = 300

#
# Passive logging file level, location, and the PID location
# Default: loglevel = info (debug, info, warning, error)
# Default: logfile = var/log/ncpa_passive.log
# Default: pidfile = var/run/ncpa_passive.pid (leave passive in pid file name)
#
loglevel = info
logfile = var/log/ncpa_passive.log
pidfile = var/run/ncpa_passive.pid

#
# Delay passive checks from starting in seconds
# Default: 0
#
# delay_start = 30

#
# The max size allowed for a log file in megabytes.
# When the log becomes larger than this, the log will be rolled over
# and a new log will be started.
# Default: 5
#
# logmaxmb =

#
# The max number of log rollovers that will be kept.
# Default: 5
#
# logbackups =

#
# -------------------------------
# Passive Configuration (NRDP)
# -------------------------------
#

[nrdp]

#
# Connection settings to the NRDP server
# parent = NRDP server location (ex: http://<address>/nrdp)
# token = NRDP server token used to send NRDP results
#
parent =
token =

#
# The hostname that will replace %HOSTNAME% in the check definitions and will be
# sent to NRDP with the check name as the service description (service name)
#
hostname = NCPA 2

#
# -------------------------------
# Passive Configuration (Kafka)
# -------------------------------
#

[kafkaproducer]

hostname = None
servers = localhost:9092
clientname = NCPA-Kafka
topic = ncpa

#
# -------------------------------
# Plugin Configuration
# -------------------------------
#

[plugin directives]

#
# Plugin path where all plugins will be ran from.
#
plugin_path = plugins/

#
# Follow symlinks located in the plugin path
#
# This is for Unix only (Linux, Mac OS X, etc)
#
follow_symlinks = 0

#
# Plugin execution timeout in seconds. Different than the check_ncpa.py timeout, which is
# normally for network connection issues. Will return a CRITICAL value and error when the plugin
# reaches the defined max execution timeout and kills the process.
# Default: 60
#
# plugin_timeout = 60

#
# Comma separated list of plugins to run through sudo. Note: You will need to update your sudoers
# configuration for these plugins to work when called with sudo.
#
# Example: check_special,check_root_files
# (Command line: sudo /<plugin_absolute_path>/check_special <arguments>)
#
# This is for Unix only (Linux, Mac OS X, etc)
#
# run_with_sudo =

#
# Extensions for plugins
# ----------------------
# The extension for the plugin denotes how NCPA will try to run the plugin. Use this
# for setting how you want to run the plugin in the command line.
#
# NOTE: Plugins without an extension will be ran in the cmdline as follows:
#       $plugin_name $plugin_args
#
# Defaults:
# .sh = /bin/sh $plugin_name $plugin_args
# .py = python $plugin_name $plugin_args
# .ps1 = powershell -ExecutionPolicy Bypass -File $plugin_name $plugin_args
# .vbs = cscript $plugin_name $plugin_args //NoLogo
# .bat = cmd /c $plugin_name $plugin_args
#
# Since windows NCPA is 32-bit, if you need to use 64-bit powershell, try the following for
# the powershell plugin definition:
# .ps1 = c:\windows\sysnative\windowspowershell\v1.0\powershell.exe -ExecutionPolicy Unrestricted -File $plugin_name $plugin_args
#

# Linux / Mac OS X
.sh = /bin/sh $plugin_name $plugin_args
.py = python $plugin_name $plugin_args

# Windows
.ps1 = powershell -ExecutionPolicy Bypass -File $plugin_name $plugin_args
.vbs = cscript $plugin_name $plugin_args //NoLogo
.wsf = cscript $plugin_name $plugin_args //NoLogo
.bat = cmd /c $plugin_name $plugin_args

If you could share the screenshot of the error displayed in the Nagios XI services page that would helps.
I will attach below

What version of Python used on both, your Nagios XI and the NCPA remote agent?
Nagios XI Server: Python 2.7.5
Server with NCPA agent: 3.6.13

mike-d-howell · Post by **mike-d-howell** » Fri Jul 30, 2021 11:42 am

Screenshot...

Post by **vtrac** » Fri Jul 30, 2021 1:48 pm

Hi Mike,
How are you doing?

Please try the following command ... change 'service=postgresql-10.service,status=running' to 'service=postgresql-10,status=running'
From:

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H drupshop-db-replica-01.osufpp.org -t 'our-lovely-secret-hash' -P 5693 -M 'services' -q 'service=postgresql-10.service,status=running'

To:

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H drupshop-db-replica-01.osufpp.org -t 'our-lovely-secret-hash' -P 5693 -M 'services' -q 'service=postgresql-10,status=running'

You can also run the below command on your Nagios XI command prompt, which will display all "services" running on your remote NCPA agent:
NOTE: please replace "x.x.x.x" with the remote NCPA agent IP address
and "yourNCPAtoken" with the NCPA's token defined in the "ncap.cfg" file.

Code: Select all

curl -k 'https://x.x.x.x:5693/api/services?token=yourNCPAtoken'

Best Regards,
Vinh

mike-d-howell · Post by **mike-d-howell** » Mon Aug 02, 2021 9:39 am

Hi Vinh,

Results of the first change in XI monitoring string, against the tester server: "UNKNOWN: No services found for service names: postgresql-10"

When I put in the curl command on the XI CLI, I kept getting an incorrect credentials message (it's quite a long string and hard to type, so I'm probably fat-fingering it). So I tried in a browser instead, and what I get is this:

Code: Select all

{
    "services": {
        "": "stopped", 
        "*": "stopped"
    }
}

This matches what I get when I log into the agent on the target server and look at Services. I think this is really a problem in the NCPA agent. Is there a way to escalate?

Post by **vtrac** » Mon Aug 02, 2021 2:47 pm

Hi Mike,
What is the output of the below command:

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H drupshop-db-replica-01.osufpp.org -t 'our-lovely-secret-hash' -P 5693 -M 'services' -q 'service=postgresql-10,status=running'

You can cut and paste the token listed inside the "/usr/local/ncpa/etc/ncpa.cfg" file.

Also, please share the output of the following "curl" command.
This command will list out ALL of the running services on your remote NCPA agent machine.
Once we get the outputs, we will find out how to adjust the "check_ncpa.py" command.
Please replace "x.x.x.x" with the IP of your NCPA remote machine.
Also, please replace "yourNCPAtoken" with your NCPA "community_string" value inside the "/usr/local/ncpa/etc/ncpa.cfg" file.

Code: Select all

curl -k 'https://x.x.x.x:5693/api/services?token=yourNCPAtoken'

Please post the outputs of the above two commands. One for "check_ncpa.py" and one for the "curl" command.

Best Regards,
Vinh

mike-d-howell · Post by **mike-d-howell** » Tue Aug 03, 2021 9:58 am

Hi Vinh,

Okay - I'll do it again...(this time I simplified my token so I can run these commands from the CLI--no fat fingers--but FYI this does Not solve the problem...I have dozens of SLES15 SP2 servers using the original token just fine.)

I manually ran your command from the NagiosXI CLI

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H drupshop-db-replica-01.osufpp.org -t 'Yes-I-put-in-the-real-token' -P 5693 -M 'services' -q 'service=postgresql-10.service,status=running'

RESULT:

Code: Select all

"UNKNOWN: No services found for service names: postgresql-10"

I then ran your Curl command from the NagiosXI CLI

Code: Select all

curl -k 'https://IP.OF.MY.SERVER:5693/api/services?token=Yes-I-put-in-the-real-token'

RESULT:

Code: Select all

{
    "services": {
        "": "stopped",
        "*": "stopped"
    }
}

When I log into the Web GUI agent on the target server and look at Services, I get the exact same message. NagiosXI doesn't seem to be the source of the problem--I think this is really a problem in the NCPA agent. Is there a way to escalate?

mike-d-howell · Post by **mike-d-howell** » Tue Aug 03, 2021 10:03 am

Snap of XI commands on CLI

Post by **vtrac** » Tue Aug 03, 2021 4:18 pm

Hi Mike,
Interesting, usually the "curl" command will list all the services for that NCPA remote agent.

How about if you run the "curl" command (below) without "services":

Code: Select all

curl -k -v 'https://IP.OF.MY.SERVER:5693/api/?token=Yes-I-put-in-the-real-token'

If works, the above command will list everything .... like disk, CPU, memory .... and more.

Please try and update results (screenshot) if possible.

I have read comments on your Feature Request page (below) and looks like changing the UID to "root" still not working.
https://github.com/NagiosEnterprises/ncpa/issues/791

I slack our development team and here what I got:

It is likely a known issue for NCPA, the way it grabs info on services can be a problem on certain systems (either with permissions or with the fact it just can't do it the way it was set to do it),
but it may also be an issue with how NCPA grabs the data, and if that's the case then they'd have to wait for the bug to be fixed

I'm very sorry, but since you said issue started at v2.2.1 .... would it possible that you downgrade to v2.2.0 for now?

Best Regards,
Vinh

Nagios Support Forum

No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15 SP3

No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15 SP3

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15

Re: No services for ncpa_listener 2.2.1 thru 2.3.1 on SLES15