Thursday, April 27, 2017

Optimizing GC settings for Sakai

The last weeks I've been playing with Java Virtual Machine (JVM) optimization. I’ve always believed that JVM tuning is somehow mystical and dark magic. You can’t be completely sure that what you change will enhance the current configuration or it will sink your servers. JVM Parametrization is something I try to touch as little as I can, but this time I had to roll up my sleeves and do some experimentation in the production servers.

We recently upgraded our Sakai [1] platform to the 11.3 version. It was great because we have some new interesting features and an awesome new responsive interface. In this case the version of Java was upgraded to 1.8 too, so some of the parameters we had set up were not working anymore.

Our first move was removing all non working parameters and just define the size of the Old Gen Memory. We had the following configuration in our servers.

       JAVA_OPTS="-server -d64 -Djava.awt.headless=true 
           -Xms6g -Xmx6g
           -XX:+UseConcMarkSweepGC
           -XX:+CMSParallelRemarkEnabled 
           -Duser.language=ca 
           -Duser.region=ES -Dhttp.agent=Sakai
           -Dorg.apache.jasper.compiler.Parser.STRICT_QUOTE_ESCAPING=false 
           -Dsun.lang.ClassLoader.allowArraySyntax=true
           -Dcom.sun.management.jmxremote"

We are running Sakai in a set of servers that have 8GB of RAM. So we defined 6GB for the CMS Old Generation Memory (Tenured). We obtained a distribution like this:

NON_HEAP: Around 750MB committed.
HEAP:
  • PAR Eden + Par Survivor (Young generation) : 300MB Committed aprox.
  • CMS Old Gen: 5.5 GB Committed aprox.
That configuration would guarantee around 1GB for the SO, so initially it seemed a good configuration. As you can see we had two extra parameters to handle the Java Garbage Collector processes:

-XX:+UseConcMarkSweepGC: Enables the use of the CMS (concurrent mark sweep) garbage collector for the old generation.

-XX:+CMSParallelRemarkEnabled: This option means that remarking is done in parallel to program execution. It’s a good option if your server has many cores.

Out of Memory and JVM Pauses

This setup seemed to work for us. The Garbage Collector calls were fired automatically by the JVM algorithm when it was needed, but quickly we started to see some memory problems. Sometimes, when a server memory occupancy was high, it started to behave dramatically wrong. A list of random errors appeared in the logs.

We use the project PSI Probe [2] to monitor JVM behavior. It recollects information about threads, memory, CPU, data, etc. So we found out that the Garbage Collector mechanism didn’t free the CMS Old Gen memory when the errors were produced. In that moment, we tried to restart tomcat we couldn’t. We received the message that we couldn't stop the server because Heap was out of memory.

Capture from the memory use screen of PSI Probe

After analyzing the logs and watching how much random it was (it happened in different times and different servers), we figured out that all the errors were produced because the servers were running out of memory. So we started to monitor the GC process in more detail. So we added the following lines to the JAVA_OPTS:

     -Xloggc:/home/sakai/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
     -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=2M

Those lines allowed us to get the GC operations in a log file in order to analyze them. We used the program called GCViwewer to visualize the data. It’s an open source project from Tagtraum [3]. It shows a lot of information about memory and Garbage Collector processes from the logs.

With this application we saw that there was a moment when the memory usage was near to the limit, that pauses in garbage collector increased. There were a lot of them and times up to 30 seconds. That produced that server stopped responding, of course.

So our first thought was to try to prevent from reaching a critical level of occupancy that make it unstable. Our first attempt was to use two options to force Garbage Collector to free memory:

      -Dsun.rmi.dgc.client.gcInterval=3600000
      -Dsun.rmi.dgc.server.gcInterval=3600000

It uses RMI to force the memory clean every hour, but it didn’t work for us, the CMS Old Gen kept increasing. So we tried another parameters. This time worked.

      -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75


-XX:+UseCMSInitiatingOccupancyOnly: Enables the use of the occupancy value as the only criterion for initiating the CMS collector.

-XX:CMSInitiatingOccupancyFraction: Sets the percentage of the old generation occupancy (0 to 100) at which to start a CMS collection cycle.

By the other hand, we also tried set up to more parameters in order to change the mark policy in the garbage collector attending to some recommendations found in Anatoliy Sokolenko's blog post [4].

-XX:+CMSScavengeBeforeRemark: It enables scavenging attempts before the CMS remark step.

We had been running these parameters for some days and the new behavior seems nice. It prevents to saturate the servers, and the GC pauses done when a big CMS Old Gen clean is performed are very short, they are under a second.

Campture from GCViewer. It shows the time spent in a Full GC performed in CMS Old Gen 

As you can observe, what I demonstrated writing this post is that I’m not good at Java tuning. This post intention was to show a little bit the process we followed. I’m looking for comments about wrong assumptions I made in this post (I’m sure I did), and find better ways to do it. So I’ll really appreciate your comments and suggestions in order to improve it.

References:
[1] Sakai Project: https://sakaiproject.org/
[2] PSI Probe project: https://github.com/psi-probe/psi-probe
[3] Tagtraum, CGViewer web page: http://www.tagtraum.com/gcviewer.html
[4]  Anatoliy Sokolenko's blog post: http://blog.sokolenko.me/2014/11/javavm-options-production.html

Other sources I used in my research process:
http://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html