Hi again,
Tonight one of our distributed servers took a break, and after it was up =
again, I noticed at our master server that everything seemed to have =
halted there.
The distributed server is responsible for reporting in ~3500 of our =
checks, and we have a check_freshness option for these at the master =
server.
So, what seems to have happened is that the freshness check starts to =
time out and each check triggers a check_dummy setting an unknown state. =
And at this particular server apparently is able to do this with about 1 =
to 4 checks / second -> unusable platform until things stabilize again. =
.. Which will probably never happen, as it will spend too much time =
doing service alerts and too little time processing external commands =
(nsca passive check results) :-/
We are running with retention data enabled, so we figured the only way =
to get out of this situation was to delete retention dat, to allow for a =
"fresh" start from PENDING.
So, in short - _everything_ that involves a shell exit, is not =
parallelized and hits more than a few checks at a time appears to break =
our setup at this platform as is?
I guess this T1000 platform is rather special, is is a "4 cores, 32 =
threads"-kind of thing - could it be that all of this parallelization =
has the exact opposite effect than what we were hoping? Possibly a =
synchronization issue on the process spawn?
I guess our best option is to go x86 asap unless someone can enlighten =
us on this issue. If anybody else is running Nagios on similar hardware =
but without any issues, please speak up.
Best regards,
Steffen Poulsen
BTW: We solved the performance data issue by having Nagios write them to =
file as suggested and putting a simple perl tail at it:
#!/usr/bin/perl
use File::Tail;
#use strict;
use IO::Socket::INET;
my $debug =3D 0;
my $logFileName =3D "/usr/local/nagios/var/service-perfdata";
my $nagServ =3D "11.11.11.11";
my $nagPort =3D "5667";
my $line;
my $file;
my $MySocket;
$file=3DFile::Tail->new(name=3D>$logFileName, maxinterval=3D>1);
while (defined($line=3D$file->read)) {
print "Received: $line \n" if $debug;
$line =3D "" . $line;
( $dummy1, $dummy2, $host, $service, $dummy3, $dummy4, $perf, =
$perfdata ) =3D split( /\t/, $line );
$send =3D "$host\t$service\t$perf\t$perfdata";
$MySocket=3Dnew IO::Socket::INET->new(PeerPort=3D>$nagPort, =
Proto=3D>'udp', PeerAddr=3D>$nagServ);
$MySocket->send($send);
$MySocket->close();
}
> -----Oprindelig meddelelse-----
> Fra: nagios-devel-bounces@lists.sourceforge.net=20
> [mailto:nagios-devel-bounces@lists.sourceforge.net] P=E5 vegne=20
> af Hendrik B=E4cker
> Sendt: 27. september 2007 14:30
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Extremely bad performance when=20
> enabling process_performance_data on Solaris 10?
>=20
> Andreas Ericsson schrieb:
> >=20
> > I haven't run into it, but I would solve it with a NEB-module that=20
> > sends the performance data to a graphing server. It's really quite=20
> > trivial to do, and a send(2) call generally finishes quickly enough.
> >=20
>=20
> Might be wrong, but a long time ago I've investigated some=20
> time to write an NEB Mod that should call send_nsca to get=20
> rid of the blocking ocsp command.
> I've found out that even a neb module is blocking too.
>=20
> Just want to say that you should take care of the time your=20
> neb module is doing something.
>=20
> But as Andreas said: a send() should be faster than an=20
> popen() that I did in the past.
>=20
> Just my 2 Cents.
>=20
> Hendrik
>=20
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: step@tdc.dk