Introduction to Compact Serialization

Metin Dumandag | Dec 12, 2022

Hazelcast offers various serialization mechanisms to convert user classes into a series of bytes, which Hazelcast can understand.

With those in-house serialization mechanisms, Hazelcast can run queries, perform indexing, run user code, and many more, on top of the data stored in its distributed data structures more efficiently than the user-provided custom serialization.

However, those serialization mechanisms had some advantages and disadvantages. Some were optimized for the binary size (IdentifiedDataSerializable), some for query speed (Portable), and some for ease of use (HazelcastJsonValue).

With those in mind, we wanted to create a new serialization mechanism that would combine all the good things offered by the existing Hazelcast serialization mechanisms and would be:

  • As compact as possible in the serialized form so that less data is transferred over the network and stored in the memory.
  • Efficient for all sorts of queries and indexing
  • Easy to use

This post will introduce the new format and show how to use it.

Overview of Compact Serialization

The core idea behind the new format is to have a schema that describes the serialized data and not to duplicate it with each serialized data. With this format, serialized data only carries an 8-byte long fingerprint that uniquely identifies the schema. The schema and the data are separate entities connected with that fingerprint. Introducing Compact Serialization.

Compact is a massive improvement over Portable Serialization, as each portable binary carries the “schema.” That means, depending on the schema and the data, one can save up to 40% of the memory occupied in Hazelcast or the number of bytes transferred over the network for the data just by switching to the new Compact Serialization format for some workloads.

Similar to Portable, having a schema that describes the data allows Hazelcast only to deserialize fields used in the queries or indexing operations, resulting in massive performance gains compared to formats like IdentifiedDataSerializable, which does not have this feature.

Having a schema also allows you to evolve it by adding or removing fields and still be able to read and write data between old and new applications.

The new format currently supports the following types, which can be used as building blocks to serialize all sorts of data:

  • Fixed-size types like boolean, int8, int16, int32, int64, float32, float64
  • Nullable versions of the fixed-size types
  • Variable-size types like string, decimal, time, date, timestamp, timestamp with timezone
  • Arrays of the types listed above
  • Nested Compact serializable objects containing the types listed above, and arrays of them

With the exception of Fixed-size types, all others are considered variable-sized.

Also, the new format is entirely language-independent and can be used with different client languages, while working on the same data.

API

Let’s see how you can use Compact Serialization for your classes.

We will use Java for the code snippets below for demonstration purposes. We tried to mimic the public APIs across all the client languages, so it should be trivial to port these examples to other languages.

Assuming you have the following class:

public class Employee {

   private final int id;
   private final String name;
   private final boolean isActive;
   private final LocalDate hiringDate;


   public Employee(int id, String name, boolean isActive, LocalDate hiringDate) {
       this.id = id;
       this.name = name;
       this.isActive = isActive;
       this.hiringDate = hiringDate;
   }

   public int getId() {
       return id;
   }

   public String getName() {
       return name;
   }

   public boolean isActive() {
       return isActive;
   }

   public LocalDate getHiringDate() {
       return hiringDate;
   }
}

You can write a serializer for it without touching a single line of the class above. The Compact Serialization APIs do not require your classes to implement special interfaces like Portable or IdentifiedDataSerializable, allowing you to use existing legacy classes in your application.

The serializer, not the class, must implement the CompactSerializer interface, as shown below.

public class EmployeeSerializer implements CompactSerializer<Employee> {
   @Override
   public Employee read(CompactReader reader) {
       int id = reader.readInt32("id");
       String name = reader.readString("name");
       boolean isActive = reader.readBoolean("isActive");
       LocalDate hiringDate = reader.readDate("hiringDate");
       return new Employee(id, name, isActive, hiringDate);
   }

   @Override
   public void write(CompactWriter writer, Employee employee) {
       writer.writeInt32("id", employee.getId());
       writer.writeString("name", employee.getName());
       writer.writeBoolean("isActive", employee.isActive());
       writer.writeDate("hiringDate", employee.getHiringDate());
   }

   @Override
   public String getTypeName() {
       return "employee";
   }

   @Override
   public Class<Employee> getCompactClass() {
       return Employee.class;
   }
}

The serializer consists of four parts, and we will go over them one by one.

The read method is the one Hazelcast calls while deserializing the streams of bytes to your classes. In this method, you are expected to read the fields of your objects with the field names and types you have used in the write method and return an instance of your class with those fields. You don’t deal with the schemas in the public API, Hazelcast takes care of the logic, and all you do is provide the fields you want to read.

In the write method, you write the fields of your object to a writer, with the types and field names according to your class. The field names provided to the writer methods don’t necessarily match with the actual field names, but it is an excellent practice to be consistent. Hazelcast will only use the field names provided to writer methods while referencing those fields. Also, this method is used to generate a schema for your class.

The getTypeName method returns the unique type name in the cluster provided by you. The type name is the piece of information used to match the serializers written in different client languages and evolved versions of the serializers written in the same language. It is part of the schema Hazelcast automatically generates out of your serializer.

The getCompactClass returns the class you have written the serializer for so that Hazelcast can map instances of this class to the serializer.

After writing the serializer, you must register it to your configuration. We will be showing the programmatic way to configure the serializer, but the declarative configuration is also very similar.

After the registration, you can start using instances of your class in everything supported by Hazelcast.

Config config = new Config();
config.getSerializationConfig()
        .getCompactSerializationConfig()
        .addSerializer(new EmployeeSerializer());

HazelcastInstance instance = Hazelcast.newHazelcastInstance(config);
IMap<Integer, Employee> employees = instance.getMap("employees");
employees.put(1, new Employee(1, "John Doe", true, LocalDate.of(2022, 1, 1)));

Employee employee = employees.get(1);

In the upcoming months, we will also be working on a code generator tool to automatically generate serializers in different languages using schema files to make writing serializers easier.

Let us also show how to achieve schema evolution support with the Compact Serialization APIs.

Imagine that we have added two new fields: a department string and an age byte.

private final String department;
private final byte age;

The only thing we need to do to support these fields is to change the read and write methods.

We need to update the write method with those new fields and write them with the correct types and field names.

@Override
public void write(CompactWriter writer, Employee employee) {
  writer.writeInt32("id", employee.getId());
  writer.writeString("name", employee.getName());
  writer.writeBoolean("isActive", employee.isActive());
  writer.writeDate("hiringDate", employee.getHiringDate());
  writer.writeString("department", employee.getDepartment());
  writer.writeInt8("age", employee.getAge());
}

The read method requires a bit more care. We now need to conditionally read the new fields. That’s why the Compact reader provides an API to check the existence of a field, using its name and the field kind. With the help of it, we can conditionally read the data from the reader or use an appropriate default value.

@Override
public Employee read(CompactReader reader) {
  int id = reader.readInt32("id");
  String name = reader.readString("name");
  boolean isActive = reader.readBoolean("isActive");
  LocalDate hiringDate = reader.readDate("hiringDate");

  String department;
  if (reader.getFieldKind("department") == FieldKind.STRING) {
      department = reader.readString("department");
  } else {
      department = "N/A";
  }

  byte age;
  if (reader.getFieldKind("age") == FieldKind.INT8) {
      age = reader.readInt8("age");
  } else {
      age = -1;
  }

  return new Employee(id, name, isActive, hiringDate, department, age);
}

There is no explicit versioning on the serializer or the class. Hazelcast automatically knows this is a different version of the same class (through the type name), and assigns a new schema and a schema id to it.

Zero-Config

One of the core requirements of Compact Serialization was the ease of use.

We tried to achieve that by requiring no changes to the user classes and simplifying schema evolution and version.

We went even further for some of the languages. We planned to provide a Compact Serializer for users, even if the user provides no configuration or a serializer for Java and C#. As for our other clients, it might be hard to extract the correct type of information out of the instances on runtime.

Previously, when Hazelcast can’t find a serializer for a type, we were throwing HazelcastSerializationException, saying that the type of the given object cannot be serialized, as there are no applicable serializers for it.

Now, we try one last time to create a serializer out of your class, using different mechanisms in different languages. For Java, we are relying on reflection.

If the fields of your object are of types supported by the zero-config serializers, we now use the reflective Compact Serializer, automatically generated on the fly.

For Java, we even support Java records, so you can serialize/deserialize them without requiring a serializer!

Imagine you have the above Employee class or the following Employee record:

public record Employee(
      int id,
      String name,
      boolean isActive,
      LocalDate hiringDate) {
}

You can use it, with zero configuration!

HazelcastInstance instance = Hazelcast.newHazelcastInstance();
IMap<Integer, Employee> employees = instance.getMap("employees");
employees.put(1, new Employee(1, "John Doe", true, LocalDate.of(2022, 1, 1)));

Employee employee = employees.get(1);

Conclusion

We have designed the Compact Serialization format considering the majority of Hazelcast users that have long-living clients/members and use Hazelcast as a data platform where you manipulate or query the data in different ways. Considering this use case, we can say that Compact Serialization is your best bet for the serialization mechanisms in the Hazelcast. The only cost the Compact format brings with it is the cost of replicating the schemas and making sure that the schemas are replicated in the cluster for clients, which happens only once per schema. We believe that these operations are amortized in the long run, and the benefits it brings to the table with smaller data sizes and fast queries are worth it.

We are really looking forward to your feedback and comments about the new format. Don’t hesitate to try the new format and share your experience with us in our community Slack or Github repository.

Relevant Resources

View All Resources
About the Author

Metin Dumandag

Software Engineer

Metin is a Software Engineer working on the Clients Team. Since joining Hazelcast as an intern, he has been working on various clients such as Python, Node.js, and Java.