Comparing Serialization Options

Which serialization option is best?“.

In this post we’ll explore some of the most common serialization options for Hazelcast, which includes standard coding, and the external libraries Avro, Kryo and Protobuf. Following our previous posts on how much memory you will need (here and here) which looked at object sizing in Java, we need to know the time to turn an object into serialized form and the space it will use.

We’ll use benchmark code (here) to measure both the speed and compactness for each, to come out with some indicative numbers.

These are indicative numbers. What is best for one data model may not be best for another data model, but we have to start somewhere.

Our Data Model and the First Warning

For this benchmark, we shall look at this data structure:

Person {
    String firstName;
    String lastName;
    Passport passport {
        String expiryDate;
        String issuingCountry;
        long issuingDate;
    }
}

Or as an example with data in JSON form:

{
    "firstName" : "Tom",
    "lastName" : "Baker",
    "passport" : {
        "issuingDate" : 1594312998966,
        "issuingCountry" : "GBR",
        "expiryDate" : "2020-07-09"
    }
}

A person has a first name, last name and a passport.

The passport has an expiration date, issuing country and issuing date. We choose to use a string for the expiry date but a number for the issuing date.

Already our data model is biassed. It is heavy on strings and light on numbers. Furthermore, names as strings tend to be quite short, generally under 10 characters, unlike other strings that could be hundreds of characters.

Warning: This skewing may make some serializations appear better for this specific data model than they behave in general.

The Serialization Choices Used

Three styles of serialization are used in this benchmark – Java standard serialization, Hazelcast optimized serialization, and external serialization libraries.

For external serialization libraries, we shall use Avro, Kryo and Protobuf. These are common choices for external libraries but there are other choices.

We won’t describe the detailed facets of each here, only the key points.

Java standard serialization

Java provides two options:

  • java.io.Serializable
  • java.io.Externalizable

Serializable is the easiest to implement as coding is optional.

Both work only with Java which would limit the usability of the cluster to Java and other JVM-based languages.

Hazelcast optimized serialization

Hazelcast provides five options:

  • com.hazelcast.nio.serialization.DataSerializable
  • com.hazelcast.nio.serialization.IdentifiedDataSerializable
  • com.hazelcast.nio.serialization.Portable
  • com.hazelcast.nio.serialization.VersionedPortable
  • com.hazelcast.core.HazelcastJsonValue

These have some optimizations internally specific to Hazelcast.

HazelcastJsonValue is similar to Serializable, you don’t need to write the serialization logic.

Of these five, IdentifiedDataSerializable, Portable, VersionedPortable and HazelcastJsonValue are interoperable. Meaning, you can use them from other languages such as Node.js or C++. This could be an important consideration now or in the future.

External serialization libraries

Although there are others, the benchmark uses three common serialization libraries.

  • Avro is an Apache project, and a common serialization choice when used with the likes of Apache Kafka.
  • Kryo is another popular choice, less directly aligned with specific use cases in the way that Avro and Protobuf are often considered.
  • Protobuf is a Google-developed protocol and frequently features when used with the gRPC procedure call framework.

These three are interoperable.

Interoperable

Most but not all of the above data serialization choices are described as interoperable. This means you can use them with multiple languages.

So, a C++ client may save some data to the Hazelcast grid which a Node.js client may later read.

In mixed development environments, and particularly with microservices, it is common for teams to use different languages. If this is the case, being able to share data between them is clearly useful.

Results

Hazelcast 4.2 is used to generate 100,000 Person objects, randomly selecting from a pool of first names and last names, with a 50% chance the Person has a Passport or not.

The data is sized once since size measurement is deterministic.

Timings are run three times to even out variance should the computer used be doing other things. Three runs isn’t a comprehensive test, but remember these are just indicative numbers for a specific data model. If the test data isn’t representative of the real data, the results will be misleading.

Sizes

Objects are serialized to byte[] and the size measured.

Sizes for 100,000 records.
                                        java.io.Serializable :   32,252,640 bytes :   first object is 402 bytes
                                      java.io.Externalizable :   15,599,682 bytes :   first object is 210 bytes
            com.hazelcast.nio.serialization.DataSerializable :   15,599,218 bytes :   first object is 206 bytes
  com.hazelcast.nio.serialization.IdentifiedDataSerializable :    6,045,448 bytes :   first object is  78 bytes
                    com.hazelcast.nio.serialization.Portable :   15,799,102 bytes :   first object is 207 bytes
           com.hazelcast.nio.serialization.VersionedPortable :   15,799,102 bytes :   first object is 207 bytes
                       com.hazelcast.core.HazelcastJsonValue :    9,948,464 bytes :   first object is 143 bytes
                                     https://avro.apache.org :    3,394,462 bytes :   first object is  43 bytes
                    https://github.com/EsotericSoftware/kryo :    3,094,346 bytes :   first object is  39 bytes
              https://developers.google.com/protocol-buffers :    4,144,752 bytes :   first object is  53 bytes

In other words, 100,000 records using java.io.Serializable needs 32Mb when serialized.

All the other formats are at least twice as compact, for this data model.

Timings

Objects are serialized to byte[] and the time measured. The byte[] is then deserialized back into a Java object and the time measured again.

Benchmark                                                           (kindStr)  Mode  Cnt      Score      Error  Units
Timer.deserialize                                        java.io.Serializable  avgt    3  33655.348 ± 6089.926  ms/op
Timer.deserialize                                      java.io.Externalizable  avgt    3   8494.364 ±  212.126  ms/op
Timer.deserialize            com.hazelcast.nio.serialization.DataSerializable  avgt    3   3563.776 ±  135.823  ms/op
Timer.deserialize  com.hazelcast.nio.serialization.IdentifiedDataSerializable  avgt    3   1861.914 ± 1042.587  ms/op
Timer.deserialize                    com.hazelcast.nio.serialization.Portable  avgt    3   3338.154 ± 5395.007  ms/op
Timer.deserialize           com.hazelcast.nio.serialization.VersionedPortable  avgt    3   2920.613 ±  461.255  ms/op
Timer.deserialize                       com.hazelcast.core.HazelcastJsonValue  avgt    3    918.050 ±  677.488  ms/op
Timer.deserialize                                     https://avro.apache.org  avgt    3   8859.665 ± 6992.005  ms/op
Timer.deserialize                    https://github.com/EsotericSoftware/kryo  avgt    3   4887.225 ± 1740.484  ms/op
Timer.deserialize              https://developers.google.com/protocol-buffers  avgt    3   2267.215 ±  365.659  ms/op
Timer.serialize                                          java.io.Serializable  avgt    3  11748.139 ± 4072.223  ms/op
Timer.serialize                                        java.io.Externalizable  avgt    3   4965.973 ±  758.891  ms/op
Timer.serialize              com.hazelcast.nio.serialization.DataSerializable  avgt    3   2312.038 ±  128.996  ms/op
Timer.serialize    com.hazelcast.nio.serialization.IdentifiedDataSerializable  avgt    3   1920.426 ±  129.446  ms/op
Timer.serialize                      com.hazelcast.nio.serialization.Portable  avgt    3   4121.768 ±   76.325  ms/op
Timer.serialize             com.hazelcast.nio.serialization.VersionedPortable  avgt    3   3735.342 ±   65.550  ms/op
Timer.serialize                         com.hazelcast.core.HazelcastJsonValue  avgt    3   1310.166 ±   75.076  ms/op
Timer.serialize                                       https://avro.apache.org  avgt    3   2701.126 ±   72.070  ms/op
Timer.serialize                      https://github.com/EsotericSoftware/kryo  avgt    3   4812.091 ±  620.043  ms/op
Timer.serialize                https://developers.google.com/protocol-buffers  avgt    3   1913.270 ±  406.705  ms/op

The score column tells us for the 100,000 objects, java.io.Serializable takes 33 seconds to serialize and 11 to deserialize. All the others are substantially faster.

Hazelcast’s HazelcastJsonValue is the winner this time, for this data model. JSON is really just a string where the structure is imposed retrospectively.

More Warnings

The above are the results for some randomly generated test data for a specific data model. Real data if available would be better.

There is no clear winner. Smallest is not the fastest.

There are also hidden considerations, testing for serialization logic and license restrictions in external libraries.

Remember that Hazelcast is a distributed store, so don’t forget the networking side. The time to serialize/deserialize may be negligible compared to the network transmission time, optimizing for speed may not give many benefits here. But if you query a lot, the deserialization cost could be important.

Network transmission time is not linear, data is transferred in blocks. The worst case would be the boundary between one and two blocks. If two blocks are needed transmission time could double, but fortunately the network buffers are easily tuned.

Summary

The serialization choice affects the size of the data object. Until the serialization choice is confirmed it’s a facade to look at the raw data and try to extrapolate how much memory will be needed.

One serialization choice can easily result in the data size being 10 times that of another choice.

If you have a lot of this type of data, it’s worth benchmarking as this will indicate if the extra development cost is justified for reduced memory usage and transmission time.

But for some data, there aren’t many data records, so the memory cost might be irrelevant and development simplicity is favored. Equally, if a near-cache is added, network transmission costs can become irrelevant.

You should always consider if now or in the future non-Java clients will connect, as this will indicate choices such as java.io.Serializable are not going to be interoperable with other languages.

Find the code for the benchmark here.

Finally

Be very careful. If you optimize for size then speed may suffer for the likes of querying. The serialization choice needs a view of the application’s whole use.