An Experiment in Streaming: Bytecode Continuous Deployment

Once one starts their journey in data streaming, one starts to discover a lot of applications beyond just the standard Extract-Transform-Load pattern.

The traditional model to deliver a new version of a Java application is to stop the process, deploy the new JAR/WAR, and start the process again. This directly results in downtime: in this day and age, most companies frown upon such downtime as it directly translates into a loss of revenue.

In this blog post, we are going to use streaming to continuously deploy bytecode to a running JVM without needing to stop it.

Thanks to Neil Stevenson who was behind the idea.

The Attach API

Before diving into the streaming part, we need to have some solid understanding of some of the APIs offered by the Java platform – regardless of the actual language (Java, Kotlin, Scala, Clojure, etc.).

Among them, stands the Attach API. It’s not very well-known, but it allows us to change the code of a class inside a running JVM. Interestingly enough, it doesn’t require the former to run in debug mode as HotSpot class reloading does.

There are only 2 requirements for that:

  1. The active JVM must be aware of the PID of the JVM which runs the to-be-modified code
  2. Both must run on the same machine (whether virtual or physical)

The Attach API is a double-edged sword: while it can allow the continuous deployment of bytecode, it’s also a security issue. If you can update the bytecode, anybody who has access to the system can. Hence, it’s mandatory to carefully evaluate the benefits-risks ratio and to put security checks in place to manage the risks. That being said, it’s an extremely powerful tool in the JVM ecosystem.

The (simplified) design of the Attach API is quite straightforward:

Attach API class diagram

The first step is to get a handle on a VirtualMachine instance. This is the goal of the attach() method, which requires the PID of the Java process to attach to.

The second step is to actually load the Java agent. This requires a bit of an introduction.

Quick Introduction to Java Agents

In short, a Java agent is a JAR with specific data in its manifest that allows it to change classes when they are loaded by a class loader. It achieves this by using the Instrumentation API (one more API to know about).

Agents can be categorized into two different buckets:

  • Static agents are set once and for all during the launch of a JVM via the -javaagent parameter
  • Dynamic agents are set via the Attach API (see above section)

Here is a summary of the main differences:

Static Dynamic
Entry-point
  • public static void premain(String agentArgs, Instrumentation inst)
  • public static void premain(String agentArgs)
  • public static void agentmain(String agentArgs, Instrumentation inst)
  • public static void agentmain(String agentArgs)
Manifest entry Premain-Class Agent-Class

Note, that nothing prevents a specific agent to be both available as a static agent and a dynamic agent, provided it fulfills requirements from both categories.

A simplified diagram of the Instrumentation API looks like this:

Instrumentation API class diagram

The entry-point into the Instrumentation API is the Instrumentation interface itself. The way to get a reference of such an instance is to actually use the aforementioned premain() or agentmain() static methods, which has an Instrumentation parameter. The JVM will actually call that method and provide it. It can be used “on the spot” or stored for later usage.

Once one has an Instrumentation instance, it can be used to:

  • Change existing class implementation (with some limitations)
  • Apply automatic transformation of bytecode
  • etc.

Note, that it’s possible to do that even within the boundaries of the Java Platform Module System.

Continuous Deployment

Continuous Deployment is the process to deliver code from development to production in a full-automated way following a series of steps. Those are specific to each tech stack, organization and sometimes even team. Here are some steps:

  • Compilation
  • Unit testing
  • Packaging
  • Integration testing
  • Containerization
  • Storing in a registry
  • Deployment in any number of environments
  • Canary deployments
  • etc.

The traditional deployment model of Java applications is to:

  1. Stop the existing running JVM
  2. Deploy the new JAR (or WAR if inside an application server)
  3. And start JVM again

An additional property of continuous deployment is the lack of downtime. Therefore, the process described above is not compatible. However, if we skip packaging and directly load the new compiled bytecode in the running JVM, no downtime occurs. Both Attach and Instrumentation API can help us in this regard.

Here are the involved components:

  1. A JVM that runs the class. This is the production environment.
  2. A Java agent JAR that is able to read an updated class from a specific location and to update the running class (1)
  3. A JVM that loads the agent (2) into the production JVM (1)
  4. A way to produce an updated class and make it accessible to the agent (2)

We now need to focus on the last part, as the rest is in place.

Bytecode Streaming

At this point, we need a way to read updated bytecode and deliver it so it can be read by the production JVM (that has been updated with agent code – cf. above section). This should be done continuously; this is the definition of streaming by the book. Hazelcast Jet fits this requirement very well.

From the Jet site:

Hazelcast Jet allows you to write modern Java code that focuses purely on data transformation while it does all the heavy lifting of getting the data flowing and computation running across a cluster of nodes. It supports working with both bounded (batch) and unbounded (streaming) data.

Jet excels at traditional distributed Extract-Transform-Load pipelines:

  • Extract: read from a variety of sources, databases, files, Kafka, etc. If a source doesn’t exist, the Jet API allows you to write your own
  • Transform: one can filter, map, flat map, etc. the data. Not only does Jet provide the primitives from Java’s Stream, it also comes up with additional stateless and stateful transforms, such as hash- joins, aggregates, etc.
  • Load: Jet can write the processed data into a lot of different out-of-the-box components. The same API lets you also write your own.

Now, bytecode is just a specific kind of data. Let’s create a Jet job that:

  1. Reads the bytecode generated by the compiling of the source code on a developer’s computer
  2. Checks whether it’s the same as already existing bytecode
  3. Makes it available in a map on a Hazelcast instance with the key as the class name and the value as the bytecode

The agent just needs to register for updates on this map. Every time the compiler creates new bytecode, it will be made available in the map by the Jet job, and the production JVM will redefine the class with the updated bytecode.

This is an overview:

Component diagram

Note that only the Production JVM and the Injection JVM need to be co-hosted on the same system. The rest of the components have no location requirements.

While Jet provides the API in Java, Kotlin makes it more engaging. The following snippets will be written in Kotlin. The complete source code can be found on GitHub.

The first step requires to read the actual bytecode. There’s no such thing provided out-of-the-box, but it’s a no-brainer to create such a source:

fun classes() = SourceBuilder
    .stream("classes", TargetPathContext())
    .fillBufferFn(ClassPathReader())
    .build()

The TargetPathContext class is a supplier that stores… the target path:

class TargetPathContext : FunctionEx<Context, ContextHolder> {
    override fun applyEx(ctx: Context) = ContextHolder()
}

class ContextHolder(val classesDirectory: Path = Paths.get(System.getProperty("target")))

The ClassPathReader class is where the reading magic really happens:

class ClassPathReader : BiConsumerEx<ContextHolder, SourceBuffer<Pair<String, ByteArray>>> {

    override fun acceptEx(ctx: ContextHolder, buffer: SourceBuffer<Pair<String, ByteArray>>) {
        val toClassName = PathToClassName(ctx.classesDirectory)
        val visitor = ClassFileVisitor()
        Files.walkFileTree(ctx.classesDirectory, visitor)
        visitor.classes.forEach {
            val name = toClassName.applyEx(it)
            val content = Files.readAllBytes(it)
            buffer.add(name to content)
        }
    }
}

The final touch is the pipeline itself:

fun pipeline() = Pipeline.create().apply {
    val mapName = "bytecode"
    readFrom(classes())
        .withIngestionTimestamps()
        .peek { it.first }
        .filterUsingService(
            ServiceFactories.iMapService(mapName),
            CheckChange()
        )
        .map { entry(it.first, it.second) }
        .writeTo(Sinks.map(mapName))
}

Note, the filterUsingService() call. It’s used to filter out bytecode that hasn’t changed.

class CheckChange : BiPredicateEx<IMap<String, ByteArray>, Pair<String, ByteArray>> {

    override fun testEx(
       map: IMap<String, ByteArray>,
       pair: Pair<String, ByteArray>) =
        map[pair.first] == null ||
     !pair.second.contentEquals(map[pair.first]!!)
}

The final flow looks like that:

Sequence diagram

A word on limitations

This proof-of-concept has some limitations:

  1. First and foremost, compared to traditional deployment, the code that runs in production is not tagged with a specific version. This is not specific to bytecode streaming, but congruent with any incremental deployment approach.
    In the real world, however, the source of the data pipeline is not a developer’s machine but the result of a build job on a continuous integration server (e.g. Jenkins). Additional metadata can be set through the build (such as a timestamp, build number, version number, etc.) and streamed along to be used by the pipeline.
  2. This prototype handles only bytecode changes, not new classes. Loading new classes requires additional class-loading magic that goes well beyond the scope of this post and its associated demo code.
  3. By definition, the redefineClasses() “must not add, remove or rename fields or methods, change the signatures of methods, or change inheritance”

Conclusion

When one thinks about stream processing, one generally thinks about reading data from somewhere, transforming it and storing it somewhere else. In general, it’s pretty narrow. Yet, the concept of data is much larger than it appears.

In this post, we showed that bytecode is data that can be processed like any other data. If one scratches beyond the surface, there is a lot of data around that is just waiting to benefit from streaming and Jet.

The complete source code can be found on GitHub. Happy streaming!