Simulator 0.4 released!

Internally the Performance/QA team is using a tool called the Simulator to simulate load on a Hazelcast cluster and see how it behaves when I apply this load for hours or even for day. This tool helps us to detect performance and stability problems early.

The simulator can be used on a predefined set of machines e.g. physical hardware. We use this setup for performance testing since we need predictability and we use the machines in the testlab for this. But for a lot of test we run in the cloud because we need to scale to more machines than we have in the testlab. The simulator makes use of JClouds to provision machines in the cloud, so in theory the Simulator can run in any cloud provider, but we mostly work with Amazon EC2. From time to time we are running with almost 200 machines since that is currently the limit of our account.

Born out of Necessity

What I personally like about the Simulator is that is a tool directly driven by our needs. So we are not adding things and hoping they will be useful to the end user, we are adding them because we need to use them ourselves.

Simulator and performance testing

We use the Simulator to do performance testing. We have integrated various profilers so that we can see what is going on in the JVM and OS:

  1. Yourkit
  2. Intel VTune
  3. Oracle Flight Recorder
  4. hprof
  5. Linux Perf

Each tool has its strong and week points, so by having a suite of tools to choose from, we can pick the right tool for the job. And the cool thing is that we can switch between profilers with a single change in a property file!

Apart from having all these profilers in place, we also proves integration with different performance measurement tools. Most notably the HdrHistogram from Gil Tene. This way we can get all kinds of relevant statistics like minimum, maximum, average and last but not least: latency distribution.

For Simulator 0.5 we plan to have performance measurement integrated in our Continuous Integration environment so we can detect a performance regression very quickly. Some other performance related features we’ll be adding are:

  1. performance delta: between 2 git commits we want to see any changes in performance.
  2. performance regression finder: using a binary search on the commit history, we can find the exact commit where a performance regression was introduced

Simulator and Continuous Integration

Doing code reviews, unit/integration-tests and static code analysis tools like findbugs/checkstyle is a fixed part of our software development process. It helps to find bugs and problems as soon as possible and to deal with them. But certain types of problems are very hard to detect, e.g. memory leaks and race problems.

That is why we run the Simulator tests every night for 6 hours using Jenkins@Cloudbees. During these nightly tests we spawn a EC2 cluster and have a suite of Simulator test run. If one or more of these tests fail, the build breaks and deal with the issue in the morning. Even though the feedback loop isn’t as fast as with a unit/integration test, with a feedbackloop of one day it is still easy to inspect the changes of the previous day and still relatively easy to find the cause.

Simulator and Release Process

Before every release we run a large suite of simulator tests; a suite that is increasing on every release, to guarantee high quality software being released.

We have defined different types of test clusters:

Cluster Client Count Cluster Size
Small 8 4
Medium 20 6 | Large | 40 | 10
X-Large 100 25

And the closer we are getting to the Final release, the larger the cluster size and the longer the runs:

Release Duration Small Medium Large X-Large
Early Access 6h YES
Early Access 12 YES YES
Release Candidate 12h YES YES YES
Final 48h YES YES YES YES

As you can see the Simulator is an key part of our QA and performance process. If you are makign use of Hazelcast, feel free to embed the Simulator in your software process as well. Download Simulator here.