LockSupport.parkNanos() Under the Hood and the Curious Case of Parking, Part II: Windows

In the previous post we have seen how LockSupport.parkNanos() is implemented on Linux, what behavior to expect and how we can tune it. The post was well-received and a few people asked how parkNanos() behaves on Windows. I’ve used Linux as a daily driver for over a decade and I didn’t feel like exploring Windows. It’s a closed source operating system and that makes it hard to explore it. Moreover, my knowledge of Windows API is virtually non-existent and I felt the learning curve would be too steep to justify the time investment.

However, Tomasz Gawęda re-ran my experiment on Windows 10 and shared his results on Twitter: LockSupport.parkNanos(55_000) took about 1.5 ms on his Windows 10 box. Just a reminder: It took about 100 μs on my Linux box. That’s an order of magnitude difference and this caught my attention!

Reproducer in Java:

It is the same code as last time:

@Override
public void run() {
  long startNanos = System.nanoTime();
  for (long i = 0; i < iterationCount; i++) {
    LockSupport.parkNanos(1);
  }
  long durationNanos = System.nanoTime() - startNanos;
  long durationMillis = NANOSECONDS.toMillis(durationNanos);
  System.out.println(iterationCount + " iterations in " + durationMillis + " ms.");

  long microsPerIteration = NANOSECONDS.toMicros(durationNanos) / iterationCount;
  System.out.println("This means each iteration took " + microsPerIteration + " microseconds");
}

I’ve run the very same experiment on my 2-in-1 Dell XPS with Windows 10. Results were certainly interesting: LockSupport.parkNanos(1) took close to 12 ms!

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000
1000 iterations in 11609 ms.
This means each iteration took 11609 microseconds

Now we are talking about 2-3 orders of magnitude difference. What is going on? Let’s start with the exploration of howparkNanos() is implemented on Windows. This is C++ code JDK ultimately calls. I skimmed the method body and line 5,246 looked interesting: WaitForSingleObject(_ParkEvent, time); This is what Windows API documentation says about it: “[…] Waits until the specified object is in the signaled state or the time-out interval elapses. […] “

This appears to be somewhat similar to pthread_cond_timedwait() we know from POSIX threads and we discussed in the previous post. There are some differences though:

  1. The Windows function receives relative time while POSIX works with absolute time
  2. The time is in milliseconds. There is no struct/parameter to pass microseconds or nanoseconds.

The #1 is not terribly interesting, but #2 is. LockSupport.parkNanos() receives nanoseconds so how on the earth it could be implemented via a system call with milliseconds granularity? I read the function once again and then I saw it:

time /= 1000000;  // Must coarsen from nanos to millis
if (time == 0) {  // Wait for the minimal time unit if zero
  time = 1;
}

In other words: JDK rounds everything to milliseconds. If you recall LockSupport(1) then JDK will ask Windows to wait 1 ms. That’s 6 orders of magnitude difference! Obviously, it’s unreasonable to expect to wait to have nanoseconds accuracy, but in the previous part, we saw accuracy on Linux was in small 10s of microseconds. On Windows, we are talking about milliseconds. As bad as it sounds it still does not explain why my Java reproducer waits about 12 ms. Let’s go deeper!

Reproducer in C

I wanted to validate my findings in something closer to the bare metal. This was the part I was rather afraid of – I wrote my last Win32 API code when I was maybe 15. This means a long time ago, just think of it: I used a printed book to learn the API! However, it turns out to be quite easy:

#include <stdio.h>
#include <windows.h>

int main( ) {
    HANDLE event = CreateEvent(NULL, TRUE, FALSE, NULL);

    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    double interval;

    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&start);

    for (int i = 0; i < 1000; i++) {
        WaitForSingleObject(event, 1);
    }

    QueryPerformanceCounter(&end);
    interval = (double) (end.QuadPart - start.QuadPart) / (frequency.QuadPart / 1000);
    printf("%f msn", interval);

    return 0;
}

Windows documentation turns out to be really good, it has tons of simple examples and it’s a joy to learn its API. Anyway, if you are not familiar with Windows API here is what’s is the program above doing:

  1. It creates an event. Events are meant to be used for thread notification
  2. There is a bunch of QueryPerformanceXY calls – this is used just as a timer to calculate elapsed time
  3. The only real meat is the loop with the WaitForSingleObject() call. This is the same system call JDK uses. It’s waiting for an event, but there is no other other thread to send it thus it’s effectively just waiting for a timeout Let’s try to run it:
$ ./wintimertest.exe
15644.657600 ms

$ ./wintimertest.exe
16028.830300 ms

Bingo! 1,000 iterations take about 16 seconds this means each iteration was about 16 ms. That’s a similar result as in Java. The result also leads to following realization: Java rounds 1 ns up to 1 ms and Windows rounds it up again to 16 ms or so. Isn’t it great?

Fun with Timers

The very first Google result for the phrase “windows timer frequency” points to this article: Windows Timer Resolution: Megawatts Wasted. I recommend everyone read it to understand the results better, but more importantly: It showed me another Windows API function: timeBeginPeriod(): The timeBeginPeriod() function requests a minimum resolution for periodic timers.

Let’s try it out in our C program. I made two changes:

  1. Add timeBeginPeriod(1) to the begin on my main() method
  2. Add the winmm.lib library. If you are using CMake then it’s as simple as target_link_libraries(wintimertest winmm.lib)

main.c:

#include <stdio.h>
#include <windows.h>

int main() {
    timeBeginPeriod(1); // ←- the only change

    HANDLE event = CreateEvent(NULL, TRUE, FALSE, NULL);

    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    double interval;

    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&start);

    for (int i = 0; i < 1000; i++) {
        WaitForSingleObject(event, 1);
    }

    QueryPerformanceCounter(&end);
    interval = (double) (end.QuadPart - start.QuadPart) / (frequency.QuadPart / 1000);
    printf("%f msn", interval);

    return 0;
}

CMake file:

cmake_minimum_required(VERSION 3.13)
project(wintimertest C)
set(CMAKE_C_STANDARD 11)
add_executable(wintimertest main.c)
target_link_libraries(wintimertest winmm.lib)

Let’s run it!

dellik@DESKTOP-DTGDFVM MINGW64 ~/CLionProjects/wintimertest
$ ./cmake-build-debug/wintimertest.exe
1678.479700 ms

dellik@DESKTOP-DTGDFVM MINGW64 ~/CLionProjects/wintimertest
$ ./cmake-build-debug/wintimertest.exe
1706.220200 ms

dellik@DESKTOP-DTGDFVM MINGW64 ~/CLionProjects/wintimertest
$ ./cmake-build-debug/wintimertest.exe
1681.011900 ms

You can see 1,000 iterations took roughly 1.6 seconds. This means each WaitForSingleObject() call took about 1.6ms. That’s 10x less than without the timeBeginPeriod() call! Moreover, the results are now roughly the same as Tomasz Gaweda reported on Twitter. However, Tomasz did not set the timeBeginPeriod(). So how come he observed the lower values out of the box? I have two hypotheses:

  1. Different Windows behave differently and use different timer frequency
  2. The timer is global and its configuration affects every process running on the same box. It takes a single application to change the timer and everyone else is affected too. The Windows Timer Resolution: Megawatts Wasted post shows quite a few applications do that hence it’s very likely Tomasz has at least one application running.

The second hypothesis seems more likely to me. I noticed when Google Chrome was running and I was scrolling pages up and down then my application behaved similarly as when the high-resolution timer was activated.

Back to Java

Now we know Windows provides a function to increase timer frequency. There is one obvious question: Is there a way to call this function from Java? Obviously one can always use JNI: You could distribute Windows specific native library along with your Java application, perhaps embed the library inside JAR and load it when needed. However this would complicate build procedure and JNI is always a bit awkward to use. Is there a simpler way? Perhaps JDK itself already calls this function? Let’s see:

14:52 $ cd /home/jara/devel/oss/jdk/src
14:52 $ grep timeBeginPeriod * -R
hotspot/os/windows/os_windows.cpp:      timeBeginPeriod(1L);
hotspot/os/windows/os_windows.cpp:  // Buf if ms is large, say 500, or 503, we should avoid the call to timeBeginPeriod().
hotspot/os/windows/os_windows.cpp:  // timeBeginPeriod() if the relative error exceeded some threshold.
hotspot/os/windows/os_windows.cpp:  // timeBeginPeriod() has been linked to problems with clock drift on win32 systems and
hotspot/os/windows/os_windows.cpp:  // (a) calls to timeBeginPeriod() and timeEndPeriod() and (b) time spent with high
hotspot/os/windows/os_windows.cpp:      MMRESULT result = timeBeginPeriod(1L);
java.desktop/windows/native/libjsound/PLATFORM_API_WinOS_MidiOut.c:    timeBeginPeriod(1);

Quite a few references to our function, this looks promising! os_windows.cpp contains this snippet:

BOOL WINAPI DllMain(HINSTANCE hinst, DWORD reason, LPVOID reserved) {
  switch (reason) {
  case DLL_PROCESS_ATTACH:
    vm_lib_handle = hinst;
    if (ForceTimeHighResolution) {
      timeBeginPeriod(1L);
    }
    WindowsDbgHelp::pre_initialize();
    SymbolEngine::pre_initialize();
    break;
  case DLL_PROCESS_DETACH:
    if (ForceTimeHighResolution) {
      timeEndPeriod(1L);
    }
    break;
  default:
    break;
  }
  return true;
}

I don’t exactly understand what is this code doing or when exactly is it called, but it appears it’s checking a JVM flag ForceTimeHighResolution and if it is set then it uses the timeBeginPeriod() function to set the high-resolution timer. The flag description in the globals.hpp looks very promising too:

product(bool, ForceTimeHighResolution, false, "Using high time resolution (for Win32 only)")

Is this it? Just a single flag and my JVM will use high-resolution timers? Well, it turns out it’s not that simple. This flag has been reported to be broken since 2006! However, the very same bug report suggests a very interesting workaround:

“Do not use ForceTimeHighResolution but instead, at the start of the application create and start a daemon Thread that simply sleeps for a very long time (that isn’t a multiple of 10ms) as this will set the timer period to be 1ms for the duration of that sleep – which in this case is the lifetime of the VM:

new Thread() {
  { this.setDaemon(true); this.start(); }
  public void run() {
    while(true) {
    try {
      Thread.sleep(Integer.MAX_VALUE);
    } catch(InterruptedException ex) { 
    }
  }
}

Note that as the sleep never terminates the timer period is never restored by the VM […]”

What the R$#@#@ have I just read? Let me rephrase it: “If you want a high-resolution timer then just start a new thread and let it sleep forever”. That’s simply hilarious! Let’s give it a try:

Add the workaround proposed at the Java bug report into our Utils class:

public static void windowsTimerHack() {
        Thread t = new Thread(() -> {
            try {
                Thread.sleep(Long.MAX_VALUE);
            } catch (InterruptedException e) { // a delicious interrupt, omm, omm 
            }
        });
        t.setDaemon(true);
        t.start();
    }

Recall the hack from our experiment:

@Override 
public void run() { 
  Utils.windowsTimerHack(); 
  long startNanos = System.nanoTime(); 
  for (long i = 0; i < iterations; i++) { 
    LockSupport.parkNanos(PARK_NANOS);
  }
[...]

And here are the results:

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1706 ms. This means each iteration took 1706 microseconds

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1734 ms. This means each iteration took 1734 microseconds

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1732 ms. This means each iteration took 1732 microseconds

One iteration took around 1.7 ms while without the hack was around 16 ms. It means the hack works! The explanation of why it works is simple. There is a class HighResolutionInterval in os_windows.cpp which calls the timeBeginPeriod(1L) in its constructor and timeEndPeriod(1L) in the destructor. The class is inserted around the Windows implementation of Thread.sleep(). When you call Thread.sleep() then the HighResolutionInterval sets the high-resolution timer. Unless the sleeping interval is divisible by 10. ¯_(ツ)_/¯

So It works, but does it feel good? Not really, starting a new thread almost never feels good. Is there a better way? Let’s have another look where is the timeBeginPeriod() called from:

$ grep timeBeginPeriod * -R
hotspot/os/windows/os_windows.cpp: timeBeginPeriod(1L); 
hotspot/os/windows/os_windows.cpp: // Buf if ms is large, say 500, or 503, we should avoid the call to timeBeginPeriod(). 
hotspot/os/windows/os_windows.cpp: // timeBeginPeriod() if the relative error exceeded some threshold. 
hotspot/os/windows/os_windows.cpp: // timeBeginPeriod() has been linked to problems with clock drift on win32 systems and 
hotspot/os/windows/os_windows.cpp: // (a) calls to timeBeginPeriod() and timeEndPeriod() and (b) time spent with high 
hotspot/os/windows/os_windows.cpp: MMRESULT result = timeBeginPeriod(1L); 
java.desktop/windows/native/libjsound/PLATFORM_API_WinOS_MidiOut.c: timeBeginPeriod(1);

Let’s focus on the last line: Something-something-MidiOut.c. Apparently, Java Sound API also relies on high-resolution timers. It’s not very surprising as music requires accurate timing and I can imagine 16ms jitter would be very noticeable. Here is the relevant part of the source code. When a Midi out device is open then it enables the high-resolution timer.

Let’s try to (ab-)use it: Add a new hack into our Utils class:

public static void windowsTimerHack_midi() {
        MidiOutDeviceProvider provider = new MidiOutDeviceProvider();
        MidiDevice.Info[] deviceInfo = provider.getDeviceInfo();
        if (deviceInfo.length == 0) { // no midi, no hack 
            return;
        }
        try {
            provider.getDevice(deviceInfo[0]).open();
        } catch (MidiUnavailableException e) {
            // ignored, hack is not applied 
        }
    }

Add it into the experiment and Run it:

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1638 ms. This means each iteration took 1638 microseconds

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1642 ms. This means each iteration took 1642 microseconds

$ java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 1000 
1000 iterations in 1638 ms. This means each iteration took 1638 microseconds

It worked too! Again, each iteration is about 1.6 ms, 10x less than by default.

Now the big question is: Which hack is better? Or perhaps: Which hack is less awful? Starting a thread seems wasteful – it allocates memory for the stack, there is some cost due to OS and JVM accounting, etc. However, I can reason about possible impacts. From the other hand, the Midi hack is a whole new world for me. I have never used sound in Java, actually, I have never programmed anything with Midi at all. Hence I cannot really reason about possible side-effects. Thus I feel somewhat less uncomfortable with the Thread hack. Obviously in a real project, one has to make sure to properly stop the thread when it’s no longer needed to prevent leaks, etc. And as with every performance optimization: You have to always measure the impact on your application.

Practical Applicability

When I was researching the same topic on Linux it was a really cool exercise, but I had doubts about its practical usability. The default granularity on Linux is just 50-60 microseconds and tuning it further felt like something reserved for very special occasions. However, Windows and its 16 ms granularity is a very different beast. It’s not uncommon for many applications to use exponential backoff strategy for various worker threads: If there is no work available then at first a thread does a little bit of busy-spinning, then perhaps something like Thread.yield() and then LockSupport.parkNanos(). However, many of these back-off strategies are based on the assumptions LockSupport.parkNanos() behaves reasonably. While 50 microseconds granularity on Linux is usually somewhat expected 16 ms on Windows can be very surprising. It certainly was surprising to me!

Here is an example: Hazelcast Jet is an open source in-memory stream processing engine. In its core, it’s optimized for low-latency. It can process millions of events per seconds without a drop of sweat. It’s also important to say Hazelcast Jet internally uses green threads with the exponential backoff strategy as described above. Most of Jet developers use MacOS X and while we also test on Windows our performance tests are centered around Linux – as this is the most common operating system for production Jet deployments. Naturally while doing this research I was wondering how all this impacts Hazelcast Jet when running on Windows. So please bear with me for yet another experiment.

Jet in Action

Jet exposes a high-level API to define stream processing pipelines. Here is an example I will be using in my experiment:

package info.jerrinot.jetexper;

import com.hazelcast.jet.*;
import com.hazelcast.jet.config.JetConfig;
import com.hazelcast.jet.datamodel.TimestampedItem;
import com.hazelcast.jet.function.DistributedFunction;
import com.hazelcast.jet.pipeline.*;


import static com.hazelcast.jet.aggregate.AggregateOperations.counting;
import static com.hazelcast.jet.pipeline.Sinks.logger;
import static com.hazelcast.jet.pipeline.WindowDefinition.tumbling;
import static com.hazelcast.jet.impl.util.Util.toLocalTime;//dont do this at home
import static java.lang.String.format;
import static java.util.concurrent.TimeUnit.SECONDS;

public class PerfTest {
    private static final long TEST_DURATION_MILLIS = SECONDS.toMillis(240);
    private static final DistributedFunction<TimestampedItem, String> FORMATTING_LAMBDA = PerfTest::formatOutput;

    private static String formatOutput(TimestampedItem tsItem) {
        return "---------- "
                + toLocalTime(tsItem.timestamp())
                + ", " + format("%,d", tsItem.item())
                + " -------";
    }


    public static void main(String[] args) {
        Pipeline pipeline = Pipeline.create();
        pipeline.drawFrom(dummySource())
                .withIngestionTimestamps()
                .window(tumbling(SECONDS.toMillis(10)))
                .aggregate(counting())
                .drainTo(logger(FORMATTING_LAMBDA));

        JetConfig config = new JetConfig();
        config.getProperties().setProperty("hazelcast.logging.type", "slf4j");
        JetInstance jetInstance = Jet.newJetInstance(config);
        jetInstance.newJob(pipeline).join();
        jetInstance.shutdown();
    }

    private static StreamSource dummySource() {
        return SourceBuilder.stream("dummySource", (c) -> System.currentTimeMillis() + TEST_DURATION_MILLIS)
                .<Long>fillBufferFn((s, b) -> {
                    long now = System.currentTimeMillis();
                    if (now < s) {
                        b.add(0L);
                    } else {
                        b.close();
                    }
                })
                .build();
    }
}

The pipeline is easy:

  1. It reads from a dummy source which always emits number 0
  2. Then Jet attaches a timestamp to each emitted item
  3. Then split timeline into 10s windows and collect items into these windows
  4. Run the count() aggregation on each window to calculate how many items were collected
  5. Finally writes the output to a log

It looks complicated, but it isn’t. Basically every 10 seconds it spits out a number showing how many items were generated in last 10 seconds. This is how output looks like on my Linux laptop:

[...] 
---------- 17:22:40.000, 64,544,125 ------- 
---------- 17:22:50.000, 65,867,834 ------- 
---------- 17:23:00.000, 68,144,723 ------- 
---------- 17:23:10.000, 67,014,642 ------- 
[...]

So about 65,000,000 items generated and processed in 10 seconds or about 6-7M items / second. Now let’s run the same code on my Windows laptop:

[...] 
---------- 17:45:40.000, 20,391,480 ------- 
---------- 17:45:50.000, 22,054,175 ------- 
---------- 17:46:00.000, 19,506,898 ------- 
---------- 17:46:50.000, 18,348,198 ------- 
[...]

The Windows numbers are ⅓ of what it did on Linux. Now my Linux laptop is quite a beast while Windows is a portable 2-in-1 device. Still, it cannot fully explain the performance difference. I already told you Hazelcast Jet uses a backoff strategy with parkNanos() for worker threads. So let’s try to re-test it on Windows, but with the timer hack activated.

[..] 
---------- 17:48:00.000, 42,718,105 ------- 
---------- 17:48:10.000, 40,780,790 ------- 
---------- 17:48:20.000, 38,068,978 ------- 
---------- 17:48:30.000, 38,543,801 ------- 
[...]

That’s quite a bit better! The throughput doubled with the hack! What does it mean?

  1. Abstractions are leaky. Abstraction over operating systems even more so!
  2. We have to run automated performance tests on Windows too

What am I supposed to do with this?

At this point, it’s a good time to link the post Windows Timer Resolution: Megawatts Wasted once again. Certainly, I am not advocating for everyone to go and include this hack in all Java applications. There are good reasons why the default timer frequency is rather conservative and the article explains why.

That’s all folks. So far we have looked at LockSupport.parkNanos() on Linux and Windows so stay tuned, there is more coming!

If you like to poke systems and learn about low-level details then follow me on Twitter!