Testing distributed resilient applications powered by Hazelcast
Applications powered by Hazelcast and that use it to drive business logic need tests that go beyond happy-path validation. Serialization, partition-aware routing, failover handling, listener wiring, retry policies, and MapStore error translation all need to be validated under realistic conditions, including edge cases like member loss, backup promotion, and client reconnection.
Testing this against an externally managed cluster is brittle. The cluster becomes shared mutable state across multiple test suite runs: tests interfere through leftover data, partition layout shifts between runs, and cluster availability becomes a precondition for your build to pass. You can’t control failure timing, so resilience paths go untested or get covered by assertions that pass by accident.
Hazelcast’s test support solves this by letting you run real cluster behavior entirely in-JVM using a mock network stack. Unlike normal embedded Hazelcast, there are no real sockets, no port allocation, no external infrastructure. Each test controls its own topology, its own data, and its own failure schedule. Failover, backup promotion, retries, and listener delivery become part of your functional contract, verified on every commit. The result is a testing model that works with Test-Driven Development, not against it. These are the same utilities Hazelcast uses in its own test suite and they’re fully battle-tested.
This post walks through what testing distributed applications powered by Hazelcast looks like in practice: unit tests validate local policy; in-JVM cluster tests validate distributed semantics; full system tests are reserved for network, security, and performance concerns. The code snippets are from a GitHub repository with executable samples. For full context, refer to the repo itself and to the official Hazelcast docs on this subject.
Setup
Hazelcast distributes the test support classes via a separate JAR made available via the tests classifier:
<dependency>
<groupId>com.hazelcast</groupId>
<artifactId>hazelcast</artifactId>
<version>{hz.version}</version>
<classifier>tests</classifier>
</dependency>
For the full dependency setup, please refer to the set up dependencies documentation.
Start with a Real Cluster Minus the Network
The foundation is TestHazelcastFactory. It creates fully functional members and clients using a mock network stack. Everything above the transport layer behaves as it would in production.
A minimal multi-member + client setup looks like this:
// preallocate resources for 2 members TestHazelcastFactory factory = new TestHazelcastFactory(2); // create two members and a client instance HazelcastInstance member1 = factory.newHazelcastInstance();HazelcastInstance member2 = factory.newHazelcastInstance();HazelcastInstance client = factory.newHazelcastClient();
This is not “embedded Hazelcast” in the usual sense. You still get partition ownership, replication, client/server separation, and async visibility guarantees. What you skip is socket binding and discovery, and that’s what makes these tests fast enough to run continuously.
Setup and Teardown
Every test that creates a TestHazelcastFactory must shut it down. Leaked instances accumulate memory, hold partition state, and interfere with subsequent tests. The factory itself is lightweight but you should create it once per class and call shutdownAll() when done:
private static TestHazelcastFactory factory;
@BeforeAll
static void initFactory() {
factory = new TestHazelcastFactory();
}
@AfterAll
static void shutdownFactory() {
factory.shutdownAll();
}
Where you create Hazelcast instances depends on what you’re testing:
// Per class — when tests share state or the instance is expensive to configure
@BeforeAll
static void setup() {
Config config = new Config();
config.setClusterName(randomName());
hz = factory.newHazelcastInstance(config);
}
// Per test — when tests need a clean instance to avoid interference or different config per test
@BeforeEach
void setup() {
Config config = new Config();
config.setClusterName(randomName());
hz = factory.newHazelcastInstance(config);
}
// Inline — when the test needs to control lifecycle directly
@Test
void failoverPromotesBackup() {
Config config = new Config();
config.setClusterName(randomName());
HazelcastInstance member1 = factory.newHazelcastInstance(config);
HazelcastInstance member2 = factory.newHazelcastInstance(config);
member1.shutdown();
assertClusterSizeEventually(1, member2);
}
Note the use of com.hazelcast.test.HazelcastTestSupport.randomName() for cluster names. Instances with different cluster names form separate clusters within the same factory, which means tests can run in parallel without cross-contamination. This is especially important when tests run concurrently in CI: without distinct cluster names, instances from unrelated tests will merge into the same cluster and produce unpredictable failures.
Assert Asynchronous Reality Explicitly
Hazelcast operations on AP data structures (maps, queues, caches, topics) are inherently asynchronous across the cluster, and client-observed state is eventually consistent by design. CP subsystem primitives (FencedLock, AtomicLong, AtomicReference) follow a different model: they provide linearizable guarantees backed by Raft consensus and are out of scope here.
For AP structures, tests that assume immediate visibility are brittle. The test utilities encode this reality directly:
assertClusterSizeEventually(2, member1); assertTrueEventually(() -> assertEquals("value1", client.getMap("map").get("key1")));
These assertions reflect how the system actually behaves: cluster formation takes time, replication is eventual, and clients observe state asynchronously.
Mocking Hazelcast API Is a Last Resort (and the Code Shows Why)
Mocking should be deliberate. As argued in Growing Object-Oriented Software, Guided by Tests, if you mock objects you don’t own, you’re effectively re-implementing their behavior in your test suite. That’s how you get green tests and broken systems.
A valid use case is failure injection at your integration boundary. For example, verifying that a MapStore exception is translated into a domain-level ServiceException. Reproducing a real “DB down” scenario can be slow and unreliable. A controlled mock gives you a deterministic failure:
MapStore<String, Customer> failingMapStore =
(MapStore<String, Customer>) mock(MapStore.class);
when(failingMapStore.load("c1"))
.thenThrow(new HazelcastSqlException("Injected failure",
new SQLException("downstream DB error")));
HazelcastInstance hz = // real instance with wired MapStore
CustomerService service = new HzCustomerService(hz);
ServiceException ex = assertThrows(ServiceException.class,
() -> service.findCustomer("c1"));
assertEquals("Find customer failed", ex.getMessage());
assertEquals("Injected failure", ex.getCause().getMessage());
This proves that boundary failures are translated consistently, the public error contract holds, and the original cause is preserved for diagnostics.
Notice that the runtime itself is not mocked and a real instance is started; only the boundary is swapped. By mocking the infrastructure APIs directly, you lose the behaviors that usually fail in production: lifecycle semantics, concurrency effects, serialization, retries, listener execution.
Rules of thumb
- Mock your code around Hazelcast (services, adapters, mappers, your MapStore/EntryProcessor logic), not Hazelcast APIs (HazelcastInstance, IMap, IQueue, Jet, CP primitives) by default.
- Use mocks mainly to inject failures at the edges: load() throwing, listener callbacks failing, downstream DB/timeouts, so you can assert your retry/wrap/telemetry policy deterministically.
- Use a real in-JVM Hazelcast cluster (TestHazelcastFactory) whenever correctness depends on Hazelcast semantics: serialization, partitioning, backups, split/merge behavior, eventual visibility, listener delivery, eviction/near-cache, retries/timeouts.
- Keep unit tests quick, but treat in-JVM cluster tests as your “truth layer” for distributed behavior; reserve full system tests for real network/TLS/WAN/performance.
Testing Failure Is Straightforward and Deterministic
Member failure is not an edge case in Hazelcast. The test support makes it explicit and testable.
From the repo:
member1.shutdown(); assertClusterSizeEventually(1, client);assertEqualsEventually(() -> client.getMap("testMap").get("key1"), "value1");
A single shutdown call validates that the cluster membership updates correctly, the client remains functional across the failure, and data survives the member loss.
Where listeners come in is testing how your application reacts to these events. The repo extends the example by registering a mocked MembershipListener on the client before shutdown, then verifying that the removal event fires with the expected member attributes. That’s wiring verification: confirming the listener is registered on the right lifecycle hook.
The more interesting case is when your application does something in response to failure. Say your service switches to a fallback data source when cluster capacity drops. You’d use your real listener with mocked dependencies and let the cluster deliver a real membership event via member1.shutdown(). Your test asserts on what your code did in response: did it call the fallback? Did it emit the right metric? The cluster does the real work. Mockito only appears at your application boundary.
Event Listeners: Test Wiring, Not Internals
The approach discussed in the previous paragraph is valid for any other type of listener.
map.addEntryListener(mockListener, true); map.put("key1", "initial");map.put("key1", "updated"); verify(mockListener, timeout(1000)) .entryUpdated(any(EntryEvent.class));
The cluster is real and the listener logic is isolated — what’s being tested is not Hazelcast’s event system but your wiring: that the listener is registered correctly, that your code handles at-least-once delivery, and that it does so within expected time bounds.
Streaming Pipelines: Test Semantics, Not Sequences
Jet pipelines are tested using the same in-JVM approach. The key difference is semantics: Streaming jobs can restart and reprocess data depending on configuration, so tests should assert semantics that remain true under retries and restarts.
The tests reflect that explicitly.
Batch pipeline:
Pipeline p = Pipeline.create();p.readFrom(TestSources.items(1, 2, 3)) .writeTo(Sinks.list("out")); jet.newJob(p).join();
Streaming pipeline with automatic termination:
pipeline .readFrom(TestSources.itemStream(50, generator)) .withoutTimestamps() .apply(assertCollectedEventually(5, items -> assertTrue(items.size() >= 10) ));
Notice what isn’t asserted: there’s no strict ordering, no exact counts, no reliance on job completion. The test encodes correctness in terms that survive restarts and retries.
Integration Tests Without Infrastructure
The repo also includes full service-level integration tests:
CustomerService customerService = new HzCustomerService(instance);OrderService orderService = new HzOrderService(instance); customerService.save(new Customer("c1", "Alice"));orderService.placeOrder(new Order("o1", "c1", "Laptop"));
This validates cross-service interaction through shared distributed state, using a single in-JVM member. When the services under test don’t depend on partition-aware behavior, one member is enough. If they do, you can wire in multiple members using the same TestHazelcastFactory setup shown earlier.
This fills a gap that neither unit tests nor system tests cover well. Mocks can verify your failover handler is called but not that it’s called under the right conditions. System tests can validate the full chain but failures are hard to trigger deterministically and expensive to run on every commit. The in-JVM approach gives you real failure semantics with the speed of a unit test, without needing production infrastructure to provoke them.
When You Actually Need More
None of this replaces system-level testing. WAN replication, TLS validation, network partition detection, split-brain merge, and real latency behavior all depend on actual TCP connections and physical (or virtual) network topology. The mock network stack doesn’t simulate any of that, by design.
For performance benchmarking and production-realistic failure testing, Hazelcast provides Hazelcast Simulator. It deploys clusters across multiple machines, generates controlled workloads, and lets you inject failures (member crashes, network partitions, resource exhaustion) while measuring throughput, latency percentiles, and recovery time. It answers the questions in-JVM tests can’t: how does your cluster behave at 50,000 ops/sec when a member goes down? How long does split-brain recovery take with your merge policy and your data volume?
The in-JVM tests prove your logic is correct. Simulator proves it holds up under load. By the time you get there, you’re testing capacity and operational behavior, not chasing logic bugs.
The Real Value Proposition
What makes this work is that the in-JVM cluster isn’t a simulation. It’s the same execution model you rely on in production, minus the physical network. Partitions are owned, backups are promoted, listeners fire, jobs restart. The mock network stack removes infrastructure cost without removing distributed behavior.
The practical consequence is that resilience becomes something you test-drive during normal development. Failover, retry behavior, and state recovery are validated in code on every commit, not discovered in staging or deferred to a quarterly resilience exercise.
For complete examples, edge cases, and variants (JUnit 4/5, client vs member, batch vs streaming), refer to the repository.
Join our community and get started
Stay engaged with Hazelcast—join our Community Slack channel to connect with peers, share insights, and influence future features. Sign up for a free Enterprise Trial License to try all our enterprise features. Build scalable, resilient distributed systems today!