Performance Tuning of the Hazelcast SQL Engine
In Hazelcast Platform 5.1 contains two querying APIs:
- Predicate API
- SQL engine
Predicate API is an older Java-based API. Even though it contains
sqlPredicate, which allows using SQL-like syntax for the
WHERE clause, the syntax is non-standard, for example
NULL handling doesn’t support the (in)famous ternary logic. It fetches the results in one batch, which limits the supported result size.
On the other hand, SQL Engine is a more modern engine, it uses standard SQL, a cost-based optimizer, and is available in all programming languages. It also supports
GROUP BY or
UNION operators, which don’t have an equivalent in the Predicate API. It streams the results to the client, so the result size isn’t limited (though this is also a restriction, because it’s not possible to restart the query if it fails mid-way).
In the next major release, we plan to deprecate the Predicate API. For this, we need feature parity and match Predicate API’s performance. The performance is the focus of this blog post: we’ll describe our journey in benchmarking and fixing some performance issues we had.
Here is a list of benchmarks that we did for that comparison:
The fifth benchmark was run with and without index on
value field. For benchmark number three, we used three different serialization methods:
For the rest of the benchmarks, we used
IdentifiedDataSerializable. If you are not familiar with our serialization options check this page.
We didn’t benchmark joins and streaming queries since they cannot be executed with Predicate API. These functionalities are available in the SQL engine only.
All the benchmarks were run in throughput and latency modes since we cared about both. In this post, I will cover only the latency part of those benchmarks.
The first results showed that the SQL was slower in most of them. The next step was to rerun those benchmarks with the attached profiler. I used Async-profiler, and since I was tracing latency issues, I chose wall-clock mode. You can read about different modes in my previous post.
All of those benchmarks were run in our testing lab, so we knew that the results were valid (no noisy-neighbor issue). The setup for the models was:
- Hazelcast cluster size: 4
- Machines with Intel Xeon CPU E5-2687W
- Heap size: 10 GB
How does Hazelcast SQL work?
Each of the cluster members contains two significant modules: IMDG and JET. IMDG is a module where we store our data. JET is a distributed batch and stream processing engine. Additionally, for parsing the SQL queries we use Apache Calcite.
When a client sends a query to a cluster, that query is just a string with parameters. The query is sent to a random member of the cluster. That member is called a coordinator of that query. The coordinator needs to:
- Create a query plan
- Convert the query plan to the JET job
- Distribute the JET job to all the members
After that is done, all the members execute the query, and the coordinator to the client streams the results.
If you do sampling profiling you need to visualize the results. The results are nothing more than a set of stactraces. My favorite way of visualization is a flame graph. The easiest way to understand what flame graphs are is to know how they are created.
First part is to draw a rectangle for each frame of each stactrace. The stactraces are drawn bottom-up and sorted alphabetically. For example, such a graph:
corresponds to set of stactraces:
- 3 samples –
a() -> h()
- 5 samples –
b() -> d() -> e() -> f()
- 2 samples –
b() -> d() -> e() -> g()
- 2 samples –
b() -> d()
- 2 samples –
The next step is joining the rectangles with the same method name to one bar:
The flame graph usually shows you how your application utilizes some resource. The resource is used by the top methods of that graph (visualized with a green bar):
So in this example, method
b() is not utilizing the resource at all; it just invokes methods that do it. Flame graphs are commonly used to present the CPU utilization, but the CPU is just one of the resources that we can visualize this way. If you use wall-clock mode then your resource is time. If you use allocation mode then your resource is a heap.
SELECT __key, this FROM IMap
- Values in IMap have fields:
int, serialized with
- IMap size – 100_000 entries
- Latency test – 24 queries per second
Let’s start with profiling the client side. Here is a wall-clock flame graph of the method that fetches the data on the client’s side:
Let’s look at the bottom of the flame graph, method
testSelect() iterates over the results and for each row, it executes three things:
accept()– that executes the deserialization of the row – each SQL row travels to the client through the network as a byte array, so we need to deserialize it to the object format
ClientIterator.hasNext()– that one returns the next row if it is already fetched or waits to receive the next portion of the result set. In the flame graph, we see the waiting part. At this point the execution is done in the cluster – nothing to improve on the client’s side.
- Some third method, let’s zoom in on that part of the flame graph
Here we can see that the method
getColumnValueForClient() (3rd from the top) executes
LinkedList.get() which consumes a lot of time. Let’s go to the source code:
columns is a field of type
List<List<?>>, and the problematic method executes
LinkedList.get(index) is known to have a bad performance, since it has
O(size) complexity. The instance is not created here; it is passed in the constructor, so we need to debug the application to find the origin of that
LinkedList. Two breakpoints later, I found a place where it was created. One solution is to switch that
ArrayList like this:
But it is not always that easy. The code where that list was created was done many years ago and is used not only by the SQL engine. We need to be careful here. I asked colleagues from the other team if we could do it, and they did their benchmarks and agreed to that change:
How did that affect our performance? Well, it depends on the number of columns and whether we do deserialization on the client’s side (it is done lazily). Here is the latency distribution for two columns and no deserialization:
Was it a mistake to choose
LinkedList in that code? No. It was a coherent decision made by the state of Hazelcast at the moment of creation of that code.
The thing to remember is that performance is a living organism. It evolves with new features. What was efficient yesterday could be a bottleneck tomorrow.
Cluster side – sum()
SELECT sum(”value”) FROM IMap
- Values in IMap have fields: one
Long, and one
in, serialized with
- IMap size – 1_000_000 entries
- Latency test – 6 queries per second
In this benchmark, we need to deserialize all the instances in IMap, get a single field from that deserialized object, and aggregate it. The majority of work is deserialization. It was strange to me that the Predicate API performed differently than SQL.
Here is a flame graph for Predicate API:
And here is one for SQL:
Now let’s do something a bit silly. Open both graphs in separate browser tabs (right mouse click on the graph -> open graphics in a new tab) and then just switch between those two tabs very quickly and try to find a difference. Did you spot it? It is at the top. The SQL graph has a lot of yellow/red code there; the Predicate API does not. Let’s see what it is about:
The frames that we can see in SQL only are
Class.forName(). We already had a class definition cache in Hazelcast, and it was strange that it wasn’t used for SQL. We can highlight all invocations of that method to see how much time it consumes:
So loading of classes consumed ~25% of the deserialization of Java objects. Let’s look at the code of our class loading:
We have four layers of caches here; each
tryLoadClass() checks if the definition is in the cache; if not it loads it. If none of the classloaders loaded the class we called
The difference between Predicate API and SQL was that the contextClassLoader was null in SQL since you cannot use user classes with SQL queries. We decided to add another level of cache:
How did that improve the performance?
The thing to remember is that “If it looks stupid but works, it isn’t stupid” – that also applies to the performance analysis. Comparing two flame graphs by quickly switching the browser tabs looks silly, but hey, it works like a charm. If you want to compare resource utilization, that should look the same.
Cluster side – sum() with JSON
SELECT sum(json_value(this, '$.value' returning integer)) FROM cache
- Values in IMap have two fields:
int, serialized as
- IMap size – 1_000_000 entries
- Latency test – 5 queries per second
Let’s go straight to the flame graph of evaluation of a value from JSON:
The big left part of that graph is acquiring a lock; it utilizes:
~70% of the time of evaluation of the value. That part of the code uses a Guava cache that uses
ReentrantLock internally. The cache contains a mapping from the string JSON path to our object representing that path. The usage of that cache is usually single insert and multiple reads. For such a usage pattern, the
ReentrantLock is not the best choice,
ConcurrentHashMap is better, for example. A simple switch from Guava cache to CHM did this improvement:
In the end, we decided to do two implementations of a cache, one for a single JSON path based on a field, and the second for the rest of the cases.
The thing to remember is that the performance bottleneck may be in 3rd party libraries/frameworks. It doesn’t mean that their code is terrible. Usually we do not know the trade-offs there, and that may hurt us.
Cluster side – scan for a single value by index
SELECT __key, this FROM iMap where col=...; // Index on col
- Values in IMap have two fields:
int, serialized with
- IMap size – 10_000_000 entries
- Latency test – 7500 queries per second
This is a benchmark where we test a distributed deployment overhead of a job on a busy cluster. The part after the deployment is easy; we just need to take an index and fetch a single row from it. Unfortunately, in such queries, we will never be better than Predicate API since the execution plan for SQL is much bigger than for Predicate API, and serialization and deployment takes more time. What we can do is to speed up that part in SQL to be as fast as possible.
When I ran that benchmark’s first instance, the mean latency was around 4,5ms on a stressed cluster. To fight that kind of latency, we need to focus on all the resources that are required for that code to run. Let’s look at the heap allocation:
Over 90% of the recorded allocation (during the creation of the plan) was done in the constructor of
SenderTasklet. That class is responsible for sending computation results to other nodes in the cluster. It created a 32k-byte array as a buffer of data to send. The value 32k was ok for a task manipulating multiple entries, but it was a waste of heap memory for a job that processes only one row.
We went with the solution to have a buffer with two initial sizes:
- Small initial size: 1k
- If that buffer is too small, it expands to 32k immediately
That approach didn’t hit larger tasks, they need to waste a 1k array at the beginning, but the allocation of such an array is much cheaper than 32k since 32k is often allocated outside the TLAB.
That change lowered the allocation rate of that benchmark from 3,8GB/s to 2,3GB/s, but that was not the only change in that part of our engine. My colleague pointed out that in some cases, we created a
SenderTasklet that knew it would never send any data through the network. We can avoid the creation of unnecessary
SenderTasklet. That change lowered the heap allocation rate to 1,5GB/s.
There were plenty of PRs for lowering that deployment overhead. The current status (mean) is:
We lowered the mean latency from 4.5ms to 1.8ms on a stressed cluster, and we still have ideas for improving it.
The thing to remember is that allocation on the heap is fast, but no allocation is more immediate.
Our SQL engine in release 5.2 is currently faster with better throughput than Predicate API in most of the benchmarks. The only kind of query where Predicate API is more immediate is a query that usually takes just milliseconds to execute. As I mentioned in the last example, we still have ideas on how to make the SQL engine faster. We are going to realize those in future releases.
In this post, I focused on four different issues that can teach us something. Let me gather those four things to remember:
- Performance is a living organism. It evolves with new features. What was efficient yesterday could be a bottleneck tomorrow.
- “If it looks stupid but works, it isn’t stupid” – every flame graph analysis technique is good as long as it works for you
- The performance bottleneck may be hidden in 3rd party libraries/frameworks
- Allocation on the heap is fast, but no allocation is faster
One additional “lesson learned” is that the Async-profiler is a handy tool for finding such high-level bottlenecks. You need to remember that there are issues in which such a profiler won’t help. Some performance bottlenecks can be understood after analyzing assembly code or with Top-down Microarchitecture Analysis.
A Real-Time Reference Architecture for Data-Driven Businesses
This real-time architecture paper gives you ideas on what to look for when pursuing a real-time business and gives some guidance on the high-level components of a real-time architecture you need to deliver the most value from your fresh data.
Achieving Real-Time for Competitive Advantage – A Technology Perspective from Silicon Valley
In this webinar, we will discuss how businesses are using data to achieve real-time operations, seen from a Silicon Valley perspective
HazelVision 11: Generic Record
Hazelcast’s distributed compute features have traditionally required developers to upload class definitions to the cluster, making it difficult to evolve classes or define them in non-Java languages. Generic records provide a more flexible way to describe data that avoids these issues.
HazelVision Episode 10: Compact Serialization
In this episode, we look at how serialization greatly affects the performance and functionality of applications deployed on Hazelcast, and yet many developers will default to the standard Java Serialization. Learn about the new Compact Serialization option and how it stacks up against the other options available.
How to Bootstrap a Hazelcast Cluster in Kubernetes
In this blog post, we are happy to work with our community member, Vishesh Ruparelia, a software engineer at one…