Check results from same check command but diffrent service.

Post by **Box293** » Wed May 19, 2010 6:46 am

Something else I have discovered is that performance graphs are not always generated sucessfully when you duplicate an existing service.

For example I have a Windows disk usage check for D:
I duplicate this service and then change all the required parameters so it becomes a disk usage check for drive Q:
I apply the configuration and wait for a graph to appear
When I look at the graph it does not display any data and has nan where values should be.

See the screenshot, it shows this behaviour for a disk usage test, same also occurred for an avg disk bytes write check.

Examples of duplicated services.png

To resolve the problem I need to delete the relevant .rrd and .xml files. This does not always work first time and I may need to delete the files a couple more times before the graphs start working properly. Actually as I've been testing this I've deleted these files about 8 times without getting it to work.

This problem occurs probably 50% of the time. Sometimes a duplicated service will produce a good graph immediately, other times it doesn't.

I also tried stopping the performance grapher service, restarting the monitoring engine, deleting the files and then starting the performance grapher service. This didn't help either.

I am unable to determine what is the cause of the problem. I hope there is enough information here to help you reproduce the problem. I am using 1.2 dev release.

tonyyarusso · Post by **tonyyarusso** » Thu May 20, 2010 3:39 pm

Not that it helps much, but I'm assuming "nan" is "Not A Number", which leads me to believe some other character/string is being thrown in along with the values. Now just to find it...

Post by **Box293** » Thu May 20, 2010 11:13 pm

Makes sense.

I don't know how to correctly view an .rrd file but the .xml file does have some information.

This is the .xml contents of a graph that displays properly

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NAGIOS>
  <DATASOURCE>
    <TEMPLATE>check_xi_service_nsclient_alt</TEMPLATE>
    <DS>1</DS>
    <NAME>2_M__Average_Disk_Bytes_Write_is_%.f</NAME>
    <UNIT>%%</UNIT>
    <ACT>0.000000</ACT>
    <WARN>30.000000</WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT>50.000000</CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <RRD>
    <RC>0</RC>
    <TXT>successful updated</TXT>
  </RRD>
  <NAGIOS_DATATYPE>SERVICEPERFDATA</NAGIOS_DATATYPE>
  <NAGIOS_HOSTNAME>2100-vault06</NAGIOS_HOSTNAME>
  <NAGIOS_HOSTSTATE>UP</NAGIOS_HOSTSTATE>
  <NAGIOS_HOSTSTATETYPE>HARD</NAGIOS_HOSTSTATETYPE>
  <NAGIOS_SERVICECHECKCOMMAND>check_xi_service_nsclient_alt!!COUNTER!-l "\\PhysicalDisk(2 M:)\\Avg. Disk Bytes/Write","2 M: Average Disk Bytes Write is %.f "!-w 30!-c 50!!!</NAGIOS_SERVICECHECKCOMMAND>
  <NAGIOS_SERVICEDESC>Disk 2 M: - Average Disk Bytes Write</NAGIOS_SERVICEDESC>
  <NAGIOS_SERVICEPERFDATA>2 M: Average Disk Bytes Write is %.f =0.000000%;30.000000;50.000000;</NAGIOS_SERVICEPERFDATA>
  <NAGIOS_SERVICESTATE>OK</NAGIOS_SERVICESTATE>
  <NAGIOS_SERVICESTATETYPE>HARD</NAGIOS_SERVICESTATETYPE>
  <NAGIOS_TIMET>1274413132</NAGIOS_TIMET>
</NAGIOS>

And this is the .xml contents of a graph that displays nan

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NAGIOS>
  <DATASOURCE>
    <TEMPLATE>check_xi_service_nsclient_alt</TEMPLATE>
    <DS>1</DS>
    <NAME>20_E__Average_Disk_Bytes_Write_is_%.f</NAME>
    <UNIT>%%</UNIT>
    <ACT>0.000000</ACT>
    <WARN>30.000000</WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT>50.000000</CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <RRD>
    <RC>1</RC>
    <TXT>/usr/local/nagios/share/perfdata/2100-vault06/Disk_20_E__-_Average_Disk_Bytes_Write.rrd: not a simple integer: '0.000000'</TXT>
  </RRD>
  <NAGIOS_DATATYPE>SERVICEPERFDATA</NAGIOS_DATATYPE>
  <NAGIOS_HOSTNAME>2100-vault06</NAGIOS_HOSTNAME>
  <NAGIOS_HOSTSTATE>UP</NAGIOS_HOSTSTATE>
  <NAGIOS_HOSTSTATETYPE>HARD</NAGIOS_HOSTSTATETYPE>
  <NAGIOS_SERVICECHECKCOMMAND>check_xi_service_nsclient_alt!!COUNTER!-l "\\PhysicalDisk(20 E:)\\Avg. Disk Bytes/Write","20 E: Average Disk Bytes Write is %.f "!-w 30!-c 50!!!</NAGIOS_SERVICECHECKCOMMAND>
  <NAGIOS_SERVICEDESC>Disk 20 E: - Average Disk Bytes Write</NAGIOS_SERVICEDESC>
  <NAGIOS_SERVICEPERFDATA>20 E: Average Disk Bytes Write is %.f =0.000000%;30.000000;50.000000;</NAGIOS_SERVICEPERFDATA>
  <NAGIOS_SERVICESTATE>OK</NAGIOS_SERVICESTATE>
  <NAGIOS_SERVICESTATETYPE>HARD</NAGIOS_SERVICESTATETYPE>
  <NAGIOS_TIMET>1274413392</NAGIOS_TIMET>
</NAGIOS>

What stands out is the following:

Code: Select all

  <RRD>
    <RC>1</RC>
    <TXT>/usr/local/nagios/share/perfdata/2100-vault06/Disk_20_E__-_Average_Disk_Bytes_Write.rrd: not a simple integer: '0.000000'</TXT>
  </RRD>

I change the check command to be %.2f i.e.: Average Disk Bytes Write is %.2f ".
I apply the configuration and then force it to perform an immediate check.
Then when I check the .xml file I get:

Code: Select all

  <RRD>
    <RC>0</RC>
    <TXT>successful updated</TXT>
  </RRD>

However the graphs still display nan.
So I delete the .xml and .rrd files.
I then force it to perform an immediate check.
Then when I check the new .xml file I get:

Code: Select all

  <RRD>
    <RC>1</RC>
    <TXT>/usr/local/nagios/share/perfdata/2100-vault06/Disk_20_E__-_Average_Disk_Bytes_Write.rrd: illegal attempt to update using time 1274414893 when last update time is 1274414913 (minimum one second step)</TXT>
  </RRD>

The graph still displays nan.

I then force it to perform an immediate check.

Code: Select all

  <RRD>
    <RC>1</RC>
    <TXT>/usr/local/nagios/share/perfdata/2100-vault06/Disk_20_E__-_Average_Disk_Bytes_Write.rrd: not a simple integer: '0.000000'</TXT>
  </RRD>

Not sure how helpful this information is.

mmestnik · Post by **mmestnik** » Fri May 21, 2010 12:45 pm

Box293 wrote:Makes sense.

I don't know how to correctly view an .rrd file...

Not that you'd ever want to, but the tool to use is rrdtool. However please don't post the full output of rrdtool, we likely don't need to know what your ping times were every 5min for the past week.

This is not the first time that a check command returned data that rrdtool could not handle. rrd tool only works on integers, so if you want precision you need to change the scale. Most rrd tool users use bits, not kbits or bytes, this makes the graphs look good and even better the averages. See the value in bits will always be a multiple of bytes(8), thus the scale is larger then the data, this is good because it provides something the graphics industry is making heavy use of call oversampling.
You'd draw an 8x10 image 3 times the size 24x30, then for every pixel you have 9 color values. This helps with shading and diffusion, a single pixel can sap information from the pixels around it. So in actuality each pixel has 26 or even 49 colors to use to identify it's self more accurately.
In rrd it helps to use a larger scale so when calculating averages the same effect is put into place, but backwards.

Here are the problems put simply.
1. Check commands currently don't have standardized options. For example the range operators on some commands use '/' as a min/max delimiter while others use a ':'. My check commands will use a ':' if any one is looking for a reference of which to use.
2. Check commands return data non-uniformly. We were working with a check command that used Nagios::Plugin. This module refused to put the "OK"/"CRITICAL"/ect at the begging of the output. It insists on a "<Name of check command> <return state as string>..." format. As a referance my check commands will have output like so printf("%s: %d %s%s", statusascapitalstring, mostimportantmetricasdiget, explinationofmetricandengluishreply, anyotherdata).
The issue here is that the performance data returned is chaotic at best and has no conception that not only should the data be expressed as an integer, but that the data should be up scaled as previously mentioned.

It should be simple to fixup the output of the check commands, for example one could write a script that calls the existing check command and messages the data returned. Though altering the check commands to produce correct output should be a trivial task of even a novice programmer. The issue is the time and effort this would take.

My recommendation, take a look at the check commands you use and patch them up. If every one fixed one or two check commands then we would be done in no time. Here is a little getting started guide.

Firstly always use some RCS and commit your work often, I do every 5 to 15 min. I'll demonstrate with rcs, even though we have switched to brz. I just haven't fully adopted this yet.

Code: Select all

yum -y install rcs
ci -l <filename> <<<"This is the check command for..."
# Edit this file as you wish, using whatever.
ci -l <filename> -m"I added some things removed that mess."
# Some times things are indented incorrectly of there is a space at the end of a line.
# Make sure to keep these edits and code changes seperate.
ci -l <filename> -m"White space."
# Every now and then you forget a '.' or a closing brace.
ci -l <filename> -m"Syntax."
# When your all done and you want to share your work, please do!
rcsdiff -u1.1 <filename> > mycahnges.patch.txt

Post by **Box293** » Fri May 21, 2010 7:14 pm

Thanks for all this information.

I think it'll take a while for all of this to absorb into my brain ...........

mmestnik · Post by **mmestnik** » Sat May 22, 2010 10:46 pm

Perhaps some one could simplify. Let's see what others have said.
This is why using values in the range of 0 to 3,200,000,000 or -1,600,000,000 to 1,600,000,000 is a good idea.
http://en.wikipedia.org/wiki/Dynamic_range
Your current values(I.E. 0 to 80) are just too insignificant to graph. The solution is,
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
It's almost like scientific notation, but instead you choose a constant exponent(perhaps even a fractional exponent).

This feeds back

into Dynamic range.

Force your check commands to output using a better range, even if there is not any grater precision(as in the case of bits VS bytes, the value is in bytes and converting to bits doesn't add any more granularity) because the graphs will simply(due to the math behind the curtain) look better.

I previously outlined how to do this, but I skipped over what to do.

This is the easy method and allows you to use whatever language you want. It's also the most in-efficient and would generally be shunned.
http://www.stonehenge.com/merlyn/UnixReview/col10.html

For this task one may wish to call on the mighty arbitrary precision calculator.
http://en.wikipedia.org/wiki/Bc_programming_language

The only other way I know is to make changes to the application directly.

Post by **Box293** » Sun May 23, 2010 5:02 am

Thank you, this is all very helpful information.

awatch · Post by **awatch** » Thu May 27, 2010 10:36 am

I am now having a similar problem with performance graphs now displaying NaN. However, this has happened to all of my performance graphs, seemingly overnight. By looking at the .rrd files in the respective perfdata folders they had stopped being updated. If I check in the interface and go back to view the historical data, the date they stopped being updated was (obviously) the date the nans began. I checked the npcd service, it was no longer running. restarted the service and the .rrd files started updating again, however all of my graphs are still only displaying nan. I have made no configuration changes since the npcd process stopped. Any suggestions would be appreciated.

Post by **Box293** » Wed Jun 02, 2010 5:35 am

awatch,
I also am having the nan issue but not with all my services.

This thread may post some light http://go.nagios.com/forum/438

I also am at a loss as to how to resolve the problem at the moment.

Post by **Box293** » Tue Jul 13, 2010 11:19 pm

tonyyarusso wrote:Not that it helps much, but I'm assuming "nan" is "Not A Number", which leads me to believe some other character/string is being thrown in along with the values. Now just to find it...

Tony,
Any progress with this issue?

This does not seem to affect just services that were duplicated but also new services I've created.

For example:

I created a host group with the list of hosts I want to peform a check on
Apply Config
I created a service that queries a windows performance counter
Service assigned to the host group I created earlier
Apply Config

I wait 1 hour for enough information to gather.
After 1 hour, some of the hosts have correct graphs and some are displayed with nan.
All hosts have this service correctly working with an Ok state.

I am using the check command check_xi_service_nsclient
I am using the COUNTER -l "\\Terminal Services\\Active Sessions","Terminal Services Active Sessions is %.f"

mmestnik wrote:Your current values(I.E. 0 to 80) are just too insignificant to graph.

This I do not understand.

From the graphs that are being created correctly, their values are showing 0, 1, 2, 6, 15, 16, 17, 18, 19, 27, 30.

And then I have graphs with nan showing and the values are showing 0, 2, 5, 6, 11, 14, 17, 26.

So I don't really understand what is going wrong.

Are you able to replicate this problem in house?

All of the hosts using the terminal server active sessions check work OK. The check returns a correct value and the service state is OK. However without being able to see this information in a pretty graph it kind of makes it hard to see what the historical data is.

Being able to view historical data in graphs is one of the key features we like about Nagios XI.

Nagios Support Forum

Check results from same check command but diffrent service.

Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Re: Check results from same check command but diffrent service.

Perfdata folders had stopped being updated.

Re: Perfdata folders had stopped being updated.

Re: Check results from same check command but diffrent service.