How Much Memory Do I Need for My Data ?

How much memory do I need for my data?

This is a pretty common question at the start of Hazelcast projects.

Load it and measure it is one answer, which is accurate but not exactly popular.

So, let’s take a quick look at why capacity planning isn’t as simple as it seems, by stepping through disks, Java and finally Hazelcast storage details.

We will see that the response,” it depends” is featured a lot.

See also here for a follow-up covering more on this topic.

File system

Before we start looking at Hazelcast, let’s look at some simple data. Imagine a file that holds these two lines.

12
345

How much disk space does this take ? We can see 5 symbols, and know there are hidden new lines and/or carriage returns. For convenience, let’s assume there are no space at the end of lines.

So we’d guess maybe 7, then use some system utility to check.

Already we may be wrong in two ways, perhaps 400% out.

Error #1

What we see on screen is 7 symbols displayed, and assume this means 7 bytes.

In a text file “1” needs at least 1 byte. But if it was “” it would need 2 bytes. Each symbol we see in the file might need 1 or 2 bytes.

If we know a block of symbols only need 1 byte for each, perhaps the storage technology can optimize for that. Or perhaps it plays safe, 2 bytes for each just in case.

This may double the size required, but “it depends” on the storage technology.

Error #2

Most operating systems provide some sort of size-on-disk utility that would allow us to measure, and confirm, the answer above.

There’s a potential error here, too. In a production system, some sort of disk mirroring will be in place.

The disk utility may report 7 bytes for that file on the disk, but it will be another 7 bytes on each mirror disk.

This may also double the size required, but “it depends” on the mirroring level.

Java

Now, the real fun begins. Our file contains symbols, they happen to be digits, but do we store them as numbers?

We’ll look at three obvious choices, “int“, “Integer” and “String“.

There are other choices, such as “BigDecimal“, or we could be more relaxed in our definition of “number” to include telephone numbers which have other characters (eg. “(123)-456-7890” has dashes and brackets).

Primitive int

Java has a primitive intrinsic type “int“.

Other languages call it “int32” or similar, which give more of a clue to the size.

What it means is 4 consecutive bytes (32 bits of memory).

Assuming no signs and the least significant bits at the right, our “12” could be:

00000000000000000000000000001100 

Our “123” could be:

00000000000000000000000001111011 

Java provides a method “Integer.toBinaryString(i)” to print this for us.

The good news for sizing is that the size is fixed. It is 4 bytes, whether it is the number “12“,
1234” or “1234556“.

Integer

Next we will glance at “Integer“, a Java object type that can hold such an “int” but also can be “null” to indicate that it doesn’t.

It’s just a glance though, as it’s down to the internals of Java what happens. jol is a pretty good place to look if you want to know more.

When we have “int” we have one section of memory holding that number.

When we have “Integer” we have two. We have a reference to a section of memory, and the section of memory itself. The size needed is the sum of the two.

Integer reference

Firstly, we have the object pointer, which is like an index. The first piece of memory, the object pointer, holds the location of the second piece of memory, the object itself.

If our JVM is 6GB in size (6,442,450,944 bytes), for example, then the object being pointed to must be stored somewhere in that memory.

Consequently, the object pointer needs in principle to be able to hold a value between 0 and 6,442,450,943. In practice, the JVM will organize that memory into sections, so some of the possible values won’t be applicable.

To store a value between 0 and 6,442,450,943 needs 33 bits. Java likes 4-byte multiples, so will actually use 4 bytes (32-bits) or 8 bytes (64-bits) for this.

Which gets used depends on the size of the JVM, and whether truncation is possible for the object reference. Addresses that need 35 bits can be truncated to fit in 4 bytes without loss of accuracy, but this option may not be enabled or possible.

Integer object

Secondly, there is the object itself. This has a metadata header added by the JVM, plus finally the actual integer value.

The metadata header depends on the JVM implementation, but it’s almost always 12 bytes, so that’s a good enough value to use as if it was invariant.

The integer value we know by definition to be 4 bytes.

A grand total of 12 bytes for the Integer object.

Recap for Integer

To store “12” or “123” in an “Integer” needs 4 or 8 bytes for the object reference and probably 16 bytes for the object itself.

20 or 24 bytes to hold “123” as a number.

A null pointer

For Integer and any objects, we have an object reference holding the address of an object that is elsewhere in the JVM’s memory.

A “null pointer”, that gives us NullPointerException is just a reference that doesn’t hold the address of an object.

If we wanted to size these, it is only the reference to count, 4 or 8 bytes. There is no object being referred to that needs counted, but it doesn’t make the object reference any smaller.

String

Strings are objects, so share similarities with Integer objects.

When we have a “String” in our code, it’s actually a reference to an area of memory holding the String object.

The reference to the area of memory is the same as for an Integer reference, 4 or 8 bytes.

That’s a good start, but from here Java interferes in an implementation-dependent way, so it’s difficult to give an exact answer.

String are constants

Strings in Java do not change, even if they look like they do.

An operation like i = i + 1 changes the value stored in an Integer.

An operation like s = s + "1" creates a new String. There are now two Strings in memory, at least until the Garbage Collector determines that the old String can be removed.

This gives the JVM flexibility for an operation such as s.substring(1). For the input String “123” the result is “23“. However, the JVM has a choice on how to do this, as it can exploit the fact that the strings are constants. If can create a new String “23” by copying the data in the first String, or it can make the second String an offset into the first String.

Which it does will depend on the JVM version and any runtime flags given to it. Clearly this doesn’t help with predicting size.

String contains bytes

Fundamentally, a String is a series of characters, and in Java, characters can be stored in bytes. The rest is up to the JVM, and this makes our attempts to predict a size difficult.

Strings provide the constructor new String(byte[]) amongst other constructors.

An obvious, common but not required choice is merely to make the String object a wrapper around the byte object.

So if it was that, the String object would contain at least the usual Java object header of 12 bytes, plus a reference (4 or 8 bytes) to the byte[] object. That’s 20 bytes perhaps.

Similarly, the byte[] object would also have the Java object header of perhaps 12 bytes. Array objects have a length, another 4 bytes. Finally, the bytes themselves. 2 bytes for “12” or 3 bytes for “123“. That is 18 or 19 bytes.

All told, perhaps 40 bytes. In reality, often some more metadata for Strings, so maybe nearer 48 bytes.

Recap for String

47 bytes to hold “123” as text, 48 bytes to hold “123” as text, maybe.

Double counting

In our code, when we have “Integer” or “String” what we actually have is a reference to an area of memory in the JVM.

It’s quite possible to have two references to the same area of memory, as this is the difference between “==” and “equals()” tests for objects.

If we’re sizing and have two references to the same area of memory, it’s important not to count it twice, which the “==” test will confirm.

Alignment!

When we put data in memory, we don’t just put it anywhere.

On a Java virtual machine, data is frequently stored with 32-bit (4 byte) or 64-bit (8 byte) alignment. This speeds up data access due to the hardware connections to the actual CPU cores.

What this means in practice is unused space in memory.

We might imagine that if we stored a 2-byte data record it is stored starting at memory address 0, the very beginning of memory.

If we then stored another 2-byte data record, it could be stored 32-bits further on, starting at memory address 4. Or it could be stored 64-bits further on, starting at memory address 8. Either way, there’s nothing in memory address 2 and memory address 3, space that is sacrificed to speed up memory access.

If our objects are large, say hundreds or thousands of bytes, then the relative impact of wasting a few bytes for alignment reduces.

Ultimately this is storage we have to pay for, even if we can’t use it.

How much memory is wasted because of such constraints? “It depends” on how big our data records are, and if the alignment is to 32-bits, 64-bits or something else.

Inside data records

The above notes there may be gaps in memory between data objects, due to alignment. This can also occur inside objects, to make fields in objects be similarly aligned.

What did we forget?

References!

We’ve put our digits into memory in Java, and later we will do the same in Hazelcast. We need to know where in memory.

We’ll ignore this for the rest of this article, other than to note that keeping track of where our data needs some sort of meta-data. And meta-data, like ordinary data, needs space for storage. How much storage? “It depends“!

Hazelcast

Now it’s time for Hazelcast. How much space does this data take up?

It depends“, of course. But on what?

On in-memory-format and backup-count.

in-memory-format

When you store data in Hazelcast, you can configure how it is stored.

Recall that a Java object exists in JVM memory, but when you send it from a client to a server it has to be serialized into a stream of bytes to transmit across the network.

The in-memory-format options control whether the receiving server should deserialize it back into Java or not.

If you select in-memory-format=OBJECT then the receiving Hazelcast server turns the byte stream back into a Java object, and from the above we know that can be a bit tricky to determine.

The other options, the default in-memory-format=BINARY & the off-heap equivalent in-memory-format=NATIVE dictate that Hazelcast store the data in the format it is transmitted. Serialized in other words.

It turns out to be useful to know the size in memory serialized, as this is also the size for transferring across the network. At some point we may also be interested in the network transmission time, and this will be affected by how big the data is when moving across the network. The size is the same for in-memory-format=BINARY and in-memory-format=NATIVE so we only need consider one.

in-memory-format=BINARY

You could have coded your Java storage for “12” and “123” with “int“, “Integer” or “String“. How much size do these take up serialized?

We get off to a good start here. Hazelcast is an object store, amongst other things.
So despite what it may look like in an IDE, you can’t put a Java primitive like “int” into Hazelcast. There are many reasons, for instance, that Hazelcast does not just work with Java and that a primitive is not an object.

If you code anything like this:

int i = 123;
hazelcastInstance.getQueue("name").offer(i);

Behind the scenes, the compiler will quietly change “int” to “Integer“. The code will actually be more like this:

int i = 123;
hazelcastInstance.getQueue("name").offer(new Integer(i));

For sizing at least, this is good news. We only have to think about “Integer” and “String“.

How big is “Integer” when serialized?

This is easy. When serializing an “Integer“, you have to consider that it may be null and apart from that, we know that, by definition, integer takes up exactly 4 bytes.

Hazelcast adds an 8-byte header so that the receiver knows what kind of thing it is going to receive. Add 4 bytes for the “Integer” gives us a size of 12.

Serialized, “Integer” (and “int“) needs 12 bytes.

How big is “String” when serialized?

This is a bit trickier than serializing an “Integer“, but not by much. Again, you to consider the object may be null, and also a String has a variable length.

Again there is the 8-byte header added by Hazelcast for the same reason, and since this is a variable-length item, 4 bytes for the length of the item to follow.

For “12” and “123“, these can be transferred a single byte and so they are. “12” is the header plus 2 bytes, and “123” is the header plus 3 bytes.

In other words, “12” takes 14 bytes and “123” takes 15. If our String was “1234” this would be 16. Nice and easy.

String or Integer?

Serialized, 12345678 would take 12 bytes as “Integer” as this is a fixed size, and 20 as a “String“.

The maximum value that can be stored as an integer is 2147483647. This is still 12 bytes when serialized as an “Integer” and 22 bytes when serialized as a “String“. The latter is larger but not dramatically. For sizing, the choice of “Integer” or “String” would suggest the former but it’s not as radically different as you might imagine.

If we were to consider 9223372036854775807, the maximum “Long” value, this is 16 bytes serialized as a “Long” and 31 as a “String“. Serialized, strings aren’t so bad.

backup-count

This config flag controls how many copies of your data record to keep in the cluster.

One server in the cluster keeps the primary copy of your data record, the target for read and write operations. (It’s not the same server for every data record.)

If you set `backup-count=1`, or leave it to default to 1, there will be one other server in the cluster with a backup copy of your data record. This is for resilience, the backup isn’t ordinarily used for much. However, if the server with the primary copy fails for some reason, the existence of the backup copy means the data isn’t lost.

You can configure for a higher number. For example, `backup-count=2` means 2 backup copies, making 3 copies in total including the primary copy.

It’s fairly obvious how this setting affects the memory needed. If you have somehow determined one copy of your data needs 100MB, then three copies will need 300MB. All copies are the same size, backups aren’t compressed or in any way handled differently.

This setting is similar to disk mirroring. More copies increase the resilience, at the expense of storage costs, network traffic and slower performance on writes with more places to write.

What’s the catch?

In many cases, your data model won’t be simple things like “String” or “Integer“, but more likely compound objects such as:

    class Person {
        private String firstName;
        private String lastName;
        private int age;
    }

This is a questionable data model. We would ideally hold the data of birth and derive the age, but let’s assume we don’t care about that for now or that it is updated daily.

How big is this?

We know how to size a “String” and an “int“, so it sounds like it should be easy.

Unfortunately, this turns out to depend on the choice of serialization framework. There are several of these, which makes it too big a topic to include in this discussion. We’ll cover it separately in another post to follow.

Conclusion

Sizing is an exact science, but you need a detailed understanding of the factors involved. You need to understand the details of disk mirroring, JVM tuning and Hazelcast configuration.

A simple configuration flag such as “backup-count” could easily be changed on a running project from 1 to 2, and the memory usage jump by 50%.

It’s important therefore to keep tight control on who can change such flags.

But it’s more important to keep an eye on actual memory usage.

Security policies may force the application of a Java patch, even if the same patch changes storage internals in a way that has a negative effect on your system.

Similarly, the data may change over time. A simple change like adding the international country code to domestic telephone numbers makes the data bigger.

Such changes are the things to be careful of. Size may drift upwards from your predictions.

For more, follow the successor post here.