upgrade 2.1.4 => 2.1.6 performance issues

CBoekhuis · Post by **CBoekhuis** » Fri May 22, 2020 9:48 am

Ok, I followed the steps as required. I saved all the output into a logfile, but it's a bit lengthy.
In short the result is as following: No change, except for the following.
- the messages in the elasticsearch logfile now complain about the 2000 queue limit being reached.
- The load of the nodes are even higher (which makes kind of sense).

Due to the high load, I've ruled out the following:
- CPU is high, but not 100% (aprox. 75%), half of it is wait I/O.
- The disks are reading at > 250MB/s (streaming?), but not at 100%. I'm a bit confused since I'm querying data from the past hour. Shouldn't that be in memory?
- Network traffic is just a couple of MB/s, so that's not it.
- Memory....I've read that NLS should assign 50% to the java proces as defined in /etc/sysconfig/elasticsearch. We have 128GB in each node and the calculation in de startup file will give 64GB as an answer:

Code: Select all

colog3:root:/etc/sysconfig> $(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )
-bash: 64339: command not found

I know (by now) that 64GB is more than the advised 32GB (learning on the fly), however when I look at the java proces with nmon (or ps aux) it shows that it takes up the full 128GB.

Code: Select all

+nmon-14g---------------------Hostname=colog3-------Refresh= 1secs ---15:06.54----------------------------------------------------------------------------+
| CPU Utilisation -------------------------------------------------------------------------------------------------------------------------------------   |
|---------------------------+-------------------------------------------------+                                                                           |
|CPU  User%  Sys% Wait% Idle|0          |25         |50          |75       100|                                                                           |
|  1  83.2   0.0   0.0  16.8|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU        >                                                                           |
|  2   4.0   1.0   0.0  95.0|U                           >                    |                                                                           |
|  3  35.3   0.0   0.0  64.7|UUUUUUUUUUUUUUUUU                                >                                                                           |
|  4   4.0   1.0   0.0  95.0|U                                                >                                                                           |
|  5   1.0   0.0   0.0  99.0|              >                                  |                                                                           |
|  6   5.9   1.0   0.0  93.1|UU                                               >                                                                           |
|  7   1.0   0.0   0.0  99.0|                                               > |                                                                           |
|  8   0.0   0.0   0.0 100.0|                                                 >                                                                           |
|  9   0.0   0.0   0.0 100.0|                                         >       |                                                                           |
| 10   1.0   0.0   0.0  99.0|                          >                      |                                                                           |
| 11   1.0   0.0   0.0  99.0|       >                                         |                                                                           |
| 12   3.0   0.0   0.0  97.0|U     >                                          |                                                                           |
| 13   1.0   0.0   0.0  99.0|        >                                        |                                                                           |
| 14   0.0   0.0   0.0 100.0|                                                 >                                                                           |
| 15   1.0   0.0   0.0  99.0|   >                                             |                                                                           |
| 16   5.0   0.0   0.0  95.0|UU                                               >                                                                           |
|---------------------------+-------------------------------------------------+                                                                           |
|Avg   9.2   0.2   0.0  90.6|UUUU                >                            |                                                                           |
|---------------------------+-------------------------------------------------+                                                                           |
| Top Processes Procs=324 mode=4 (1=Basic, 3=Perf 4=Size 5=I/O)-------------------------------------------------------------------------------------------|
|  PID    %CPU ResSize    Command                                                                                                                         |
|          Used      KB                                                                                                                                   |
|   20831 115.5 124512128 /bin/java -Xms64339m -Xmx64339m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFrac|
|b   1634  39.5  778176 /bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSIniti|
|/    666   0.0  131776 /usr/lib/systemd/systemd-journald                                                                                       cyFraction|
|n   1506   0.0   75612 /usr/sbin/rsyslogd -n                                                                                                   ome=/usr/l|
|/  19114   0.0   21776 /usr/bin/python -tt /usr/sbin/yum-cron /etc/yum/yum-cron-hourly.conf                                                    r/logstash|
|   20368   0.0   14780 /usr/bin/php -q /var/www/html/nagioslogserver/www/index.php jobs                                                                  |
|   20367   0.0   14776 /usr/bin/php -q /var/www/html/nagioslogserver/www/index.php poller                                                                |
|   20369   0.0   14616 /usr/bin/php -q /var/www/html/nagioslogserver/www/index.php jobs/apache                                                           |
|    8385   0.0   13624 /usr/sbin/httpd -DFOREGROUND                                                                                                      |
|    4315   0.0   13528 /usr/sbin/httpd -DFOREGROUND                                                                                                      |
|    7344   0.0   13512 /usr/sbin/httpd -DFOREGROUND                                                                                                      |
|    8400   0.0   13388 /usr/sbin/httpd -DFOREGROUND                                                                                                      |
|    7906   0.0   13284 /usr/sbin/httpd -DFOREGROUND

I don't know a lot about java, but I do know that java GC etc. is something you do not want. It would explain why the load is high, the disks are very busy and the system has even a problem with something as simple generating the Home Page (takes minutes and also fills up the search queue).

I would like to start the nodes fixing them at either 32GB mem as adviced, or 64GB as "has always worked" to see if my theory makes sence. Soooooo.....can you provide me with the correct syntax?

As for the output of the above provided steps, if you want hem, let me know. I just need to make them more readable, I need a little bit extra time for that.

Grreting...Hans

Post by **cdienger** » Fri May 22, 2020 2:33 pm

The memory is a good catch. I did notice you had more than the recommend max of 64GB but thought that had probably been addressed already since I would have expected problems with the previous version as well. Maybe the planets aligned just right now though.

We recommend 64GB and a max for the system since Elasticsearch will take half by default and going above 32GB of memory for the java process can actually impact performance negatively.

To set the memory you can edit /etc/sysconfig/elasticsearch and change this line:

Code: Select all

ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

to:

Code: Select all

ES_HEAP_SIZE=3100m

(3100m because it's usually a safe number given how some systems can round an 3200m may actually be more than 32GB)

CBoekhuis · Post by **CBoekhuis** » Mon May 25, 2020 9:49 am

Hi,

here's a recap of what I've done today and what I've noticed.

- We've had a weekend in between and the first thing I noticed today that elasticsearch on both nodes were now down to roughly 64GB which is what we would expect it to be. Apparently the Java GC did what it should have done. Let's keep this in mind.
- I inserted the 3100m parameter into /etc/sysconfig/elasticsearch on both nodes and restarted elasticsearch on one of the nodes (colog3). I accessed the cluster status screen, shards are being assigned. Nice!! But then everything stopped. Web interface became non-responsive and XI qeuries started failing. I let it stay like that for another half hour and then reversed the ES_HEAP_SIZE parameter back to 50% of the system memory and restarted elasticsearch. Everything came back up to normal. Ok, to bad. However I noticed that on both nodes the elasticsearch proces again had +/-64GB memory in use.
- I enabled some of the XI services to see what happens. Again the search queue's fill up and the elasticsearch file starts to fill up. Again I quickly disable the services, the nodes are very slugish.
- Xi query services are disabled and the java proc(s) use 64GB. At that point I switched from the cluster status screen to the Home screen. It takes minutes, search queue fills up, disk IO with 200-300MB/s and the java proc(s) increase from 64GB towards the 128GB. Why? The Home screen shows you a list of dashboards and a graph of logs per 15m, doesn't seem big or anything. I'm clueless, but hey, let's keep it in mind.
- I decided to close all indices beside the 35 most current (instead of the 340 I have). No problem. I released the XI query services and all goes wel!!
- Let's restart again with the ES_HEAP_SIZE parameter, edited the file and restarted node colog3. It takes a long time and doesn't seem right......hang on.....3100m isn't 31GB that 3,1GB

. Ok, fixed that and restarted everything for both nodes.
- Up and running with 35 open indices, XI services enabled and working, search queue not filled and on both nodes the java proc is claiming +/- 42/44GB (it is using more that the xms/xmx settings).

I know this is a long list (and filled with spelling mistakes), but I'm seeing a pattern here. Mind you I have no knowledge of the internal workings of elasticsearch, I'll leave that to you

.
The XI queries are basically like: has string y for host x appeared in the past 5 minutes (or 1 hour). However, it looks like all "OPEN" indices are scanned to find something happening in the very present. That would explain why the java proc grows so much in memory, high disk read IO, etc. Even with the 35 open indices it's now barely working.

In case you wonder, why having 340 open indices, the snapshot backup only snapshots the open indices. So in case of an emergency where the complete cluster is gone, I don't want to lose that much data.

Can you confirm that an 'in the past 5 minutes' query should dig through all open indices? Because it seems to me it's doing just that.

Kind regards....Hans

Post by **cdienger** » Tue May 26, 2020 4:52 pm

It sounds like you're on to something here, but due to the backlog of cases from the long weekend I wasn't able to test this today. I will take another look first thing tomorrow.

Post by **cdienger** » Wed May 27, 2020 12:47 pm

I've not been able to reproduce this yet. Can you clarify exactly which services you are enabling to see this behavior?

I'd also be curious to see some debug logging taken when you run the query. To do this you can edit /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml and add this to the bottom:

Code: Select all

index.search.slowlog.threshold.query.trace: 1ms
index.search.slowlog.threshold.fetch.trace: 1ms
index.search.slowlog.threshold.index.trace: 1ms

then restart elasticsearch:

Code: Select all

service elasticsearch restart

and force the query to run again before removing the config and restarting the service again. I'd like to see the /var/log/elasticsearch/<UUID>__index_search_slowlog.log that is produced.

CBoekhuis · Post by **CBoekhuis** » Thu May 28, 2020 9:03 am

The services I'm talking about are the Nagios XI service checks. They use the check_nagioslogserver.php script to gather information from LS. (I replaced the API key with X's in the example

)

For example:

Code: Select all

/usr/local/nagios/libexec/check_nagioslogserver.php --url='http://colog3.boekhuis.nl/nagioslogserver/' --apikey='XXXXXXXXXXXXXXXXXXXXX' --minutes='60' --warn=':1' --crit='1:' --query='{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"message:RsyslogKeepAliveCheck AND program:nagios_test AND logsource:'cobbler'"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":"now-1h","to":"now"}}}]}}}}}'
OK: 1 matching entries found |logs=1;:1;1:

I restarted elasticsearch on node colog3 with the new parameters you requested. As soon as the cluster was green I manually started the above query to get something in the logfile. At this moment I realize that maybe it would have made more sense if I also opened more indices, since only 35 are open instead of the 340. Oh wel.

Post by **cdienger** » Thu May 28, 2020 3:30 pm

The range is what is throwing it off and causing it to count all records in the db. Because the field is @timestamp it is expecting an epoch time(or at least some digits it doesn't really matter since the check/query endpoing will receive the "minutes=60" option and replace it with the start and current epoch). Try running the query like this instead:

Code: Select all

/usr/local/nagios/libexec/check_nagioslogserver.php --url='http://colog3.boekhuis.nl/nagioslogserver/' --apikey='XXXXXXXXXXXXXXXXXXXXX' --minutes='60' --warn=':1' --crit='1:' --query='{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"message:RsyslogKeepAliveCheck AND program:nagios_test AND logsource:'cobbler'"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":123,"to":456}}}]}}}}}'

CBoekhuis · Post by **CBoekhuis** » Fri May 29, 2020 8:30 am

Ok, I made the changes in the service checks that I have, they all seem to return the expected results, so that's fine.
I opened 2 extra months worth of indices to see how the peformance impact is with the new change. I did get a lot of search queue limit reached messages, but it did stabilize after a little while. I reopened another month of indices, but that was just too much, I had to go back to +/- 35 open indices.
At this moment I've ran out of ideas.

Post by **cdienger** » Fri May 29, 2020 4:03 pm

Was the change done to all the checks on the XI side? I count 7 services that use "now-X" in their query.

CBoekhuis · Post by **CBoekhuis** » Sat May 30, 2020 4:36 am

Yes, I changed all those services.

Nagios Support Forum

upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues