Comparing Serialization Options
“Which serialization option is best?“.
In this post we’ll explore some of the most common serialization options for Hazelcast, which includes standard coding, and the external libraries Avro, Kryo and Protobuf. Following our previous posts on how much memory you will need (here and here) which looked at object sizing in Java, we need to know the time to turn an object into serialized form and the space it will use.
We’ll use benchmark code (here) to measure both the speed and compactness for each, to come out with some indicative numbers.
These are indicative numbers. What is best for one data model may not be best for another data model, but we have to start somewhere.
Our Data Model and the First Warning
For this benchmark, we shall look at this data structure:
Person { String firstName; String lastName; Passport passport { String expiryDate; String issuingCountry; long issuingDate; } }
Or as an example with data in JSON form:
{ "firstName" : "Tom", "lastName" : "Baker", "passport" : { "issuingDate" : 1594312998966, "issuingCountry" : "GBR", "expiryDate" : "2020-07-09" } }
A person has a first name, last name and a passport.
The passport has an expiration date, issuing country and issuing date. We choose to use a string for the expiry date but a number for the issuing date.
Already our data model is biassed. It is heavy on strings and light on numbers. Furthermore, names as strings tend to be quite short, generally under 10 characters, unlike other strings that could be hundreds of characters.
Warning: This skewing may make some serializations appear better for this specific data model than they behave in general.
The Serialization Choices Used
Three styles of serialization are used in this benchmark – Java standard serialization, Hazelcast optimized serialization, and external serialization libraries.
For external serialization libraries, we shall use Avro, Kryo and Protobuf. These are common choices for external libraries but there are other choices.
We won’t describe the detailed facets of each here, only the key points.
Java standard serialization
Java provides two options:
java.io.Serializable
java.io.Externalizable
Serializable
is the easiest to implement as coding is optional.
Both work only with Java which would limit the usability of the cluster to Java and other JVM-based languages.
Hazelcast optimized serialization
Hazelcast provides five options:
com.hazelcast.nio.serialization.DataSerializable
com.hazelcast.nio.serialization.IdentifiedDataSerializable
com.hazelcast.nio.serialization.Portable
com.hazelcast.nio.serialization.VersionedPortable
com.hazelcast.core.HazelcastJsonValue
These have some optimizations internally specific to Hazelcast.
HazelcastJsonValue
is similar to Serializable
, you don’t need to write the serialization logic.
Of these five, IdentifiedDataSerializable
, Portable
, VersionedPortable
and HazelcastJsonValue
are interoperable. Meaning, you can use them from other languages such as Node.js or C++. This could be an important consideration now or in the future.
External serialization libraries
Although there are others, the benchmark uses three common serialization libraries.
- Avro is an Apache project, and a common serialization choice when used with the likes of Apache Kafka.
- Kryo is another popular choice, less directly aligned with specific use cases in the way that Avro and Protobuf are often considered.
- Protobuf is a Google-developed protocol and frequently features when used with the gRPC procedure call framework.
These three are interoperable.
Interoperable
Most but not all of the above data serialization choices are described as interoperable. This means you can use them with multiple languages.
So, a C++ client may save some data to the Hazelcast grid which a Node.js client may later read.
In mixed development environments, and particularly with microservices, it is common for teams to use different languages. If this is the case, being able to share data between them is clearly useful.
Results
Hazelcast 4.2 is used to generate 100,000 Person
objects, randomly selecting from a pool of first names and last names, with a 50% chance the Person
has a Passport
or not.
The data is sized once since size measurement is deterministic.
Timings are run three times to even out variance should the computer used be doing other things. Three runs isn’t a comprehensive test, but remember these are just indicative numbers for a specific data model. If the test data isn’t representative of the real data, the results will be misleading.
Sizes
Objects are serialized to byte[]
and the size measured.
Sizes for 100,000 records. java.io.Serializable : 32,252,640 bytes : first object is 402 bytes java.io.Externalizable : 15,599,682 bytes : first object is 210 bytes com.hazelcast.nio.serialization.DataSerializable : 15,599,218 bytes : first object is 206 bytes com.hazelcast.nio.serialization.IdentifiedDataSerializable : 6,045,448 bytes : first object is 78 bytes com.hazelcast.nio.serialization.Portable : 15,799,102 bytes : first object is 207 bytes com.hazelcast.nio.serialization.VersionedPortable : 15,799,102 bytes : first object is 207 bytes com.hazelcast.core.HazelcastJsonValue : 9,948,464 bytes : first object is 143 bytes https://avro.apache.org : 3,394,462 bytes : first object is 43 bytes https://github.com/EsotericSoftware/kryo : 3,094,346 bytes : first object is 39 bytes https://developers.google.com/protocol-buffers : 4,144,752 bytes : first object is 53 bytes
In other words, 100,000 records using java.io.Serializable
needs 32Mb when serialized.
All the other formats are at least twice as compact, for this data model.
Timings
Objects are serialized to byte[]
and the time measured. The byte[]
is then deserialized back into a Java object and the time measured again.
Benchmark (kindStr) Mode Cnt Score Error Units Timer.deserialize java.io.Serializable avgt 3 33655.348 ± 6089.926 ms/op Timer.deserialize java.io.Externalizable avgt 3 8494.364 ± 212.126 ms/op Timer.deserialize com.hazelcast.nio.serialization.DataSerializable avgt 3 3563.776 ± 135.823 ms/op Timer.deserialize com.hazelcast.nio.serialization.IdentifiedDataSerializable avgt 3 1861.914 ± 1042.587 ms/op Timer.deserialize com.hazelcast.nio.serialization.Portable avgt 3 3338.154 ± 5395.007 ms/op Timer.deserialize com.hazelcast.nio.serialization.VersionedPortable avgt 3 2920.613 ± 461.255 ms/op Timer.deserialize com.hazelcast.core.HazelcastJsonValue avgt 3 918.050 ± 677.488 ms/op Timer.deserialize https://avro.apache.org avgt 3 8859.665 ± 6992.005 ms/op Timer.deserialize https://github.com/EsotericSoftware/kryo avgt 3 4887.225 ± 1740.484 ms/op Timer.deserialize https://developers.google.com/protocol-buffers avgt 3 2267.215 ± 365.659 ms/op Timer.serialize java.io.Serializable avgt 3 11748.139 ± 4072.223 ms/op Timer.serialize java.io.Externalizable avgt 3 4965.973 ± 758.891 ms/op Timer.serialize com.hazelcast.nio.serialization.DataSerializable avgt 3 2312.038 ± 128.996 ms/op Timer.serialize com.hazelcast.nio.serialization.IdentifiedDataSerializable avgt 3 1920.426 ± 129.446 ms/op Timer.serialize com.hazelcast.nio.serialization.Portable avgt 3 4121.768 ± 76.325 ms/op Timer.serialize com.hazelcast.nio.serialization.VersionedPortable avgt 3 3735.342 ± 65.550 ms/op Timer.serialize com.hazelcast.core.HazelcastJsonValue avgt 3 1310.166 ± 75.076 ms/op Timer.serialize https://avro.apache.org avgt 3 2701.126 ± 72.070 ms/op Timer.serialize https://github.com/EsotericSoftware/kryo avgt 3 4812.091 ± 620.043 ms/op Timer.serialize https://developers.google.com/protocol-buffers avgt 3 1913.270 ± 406.705 ms/op
The score
column tells us for the 100,000 objects, java.io.Serializable
takes 33 seconds to serialize and 11 to deserialize. All the others are substantially faster.
Hazelcast’s HazelcastJsonValue
is the winner this time, for this data model. JSON is really just a string where the structure is imposed retrospectively.
More Warnings
The above are the results for some randomly generated test data for a specific data model. Real data if available would be better.
There is no clear winner. Smallest is not the fastest.
There are also hidden considerations, testing for serialization logic and license restrictions in external libraries.
Remember that Hazelcast is a distributed store, so don’t forget the networking side. The time to serialize/deserialize may be negligible compared to the network transmission time, optimizing for speed may not give many benefits here. But if you query a lot, the deserialization cost could be important.
Network transmission time is not linear, data is transferred in blocks. The worst case would be the boundary between one and two blocks. If two blocks are needed transmission time could double, but fortunately the network buffers are easily tuned.
Summary
The serialization choice affects the size of the data object. Until the serialization choice is confirmed it’s a facade to look at the raw data and try to extrapolate how much memory will be needed.
One serialization choice can easily result in the data size being 10 times that of another choice.
If you have a lot of this type of data, it’s worth benchmarking as this will indicate if the extra development cost is justified for reduced memory usage and transmission time.
But for some data, there aren’t many data records, so the memory cost might be irrelevant and development simplicity is favored. Equally, if a near-cache is added, network transmission costs can become irrelevant.
You should always consider if now or in the future non-Java clients will connect, as this will indicate choices such as java.io.Serializable
are not going to be interoperable with other languages.
Find the code for the benchmark here.
Finally
Be very careful. If you optimize for size then speed may suffer for the likes of querying. The serialization choice needs a view of the application’s whole use.