Intrinsically fast: more JVM performance tinkering

I didn’t expect my last post on JVM perf to be so well received, so I thought I’d carry on digging into why your code does (or doesn’t) run fast! Let’s forget about concurrency for now and instead focus on the executable machine code that the Java Virtual Machine (and particularly HotSpot) generates.

In Java-land it’s pretty common to hear people mention stuff about ‘warmup times’, especially in the context of an incendiary micro benchmark that conclusively proves IO framework x’s Hello World is an order of magnitude quicker than that of framework y. You may also have come across tools like JMH for running these things methodically. Or you may just be confused by some guy on Stackoverflow wondering why his Sin function runs slower than the standard Java one.

The good news is that Java 6+ lets you peek under the hood at the stuff HotSpot is actually emitting so you can get concrete answers to these questions. As a motivating example, let’s take a single function bundled with the JDK and write our own implementation: Long.bitCount():

public static int myBitCount(long i) {
    i = i - ((i >>> 1) & 0x5555555555555555L);
    i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
    i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
    i = i + (i >>> 8);
    i = i + (i >>> 16);
    i = i + (i >>> 32);
    return (int)i & 0x7f;
}

public static int myBitCount(long i) {
    i = i - ((i >>> 1) & 0x5555555555555555L);
    i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
    i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
    i = i + (i >>> 8);
    i = i + (i >>> 16);
    i = i + (i >>> 32);
    return (int)i & 0x7f;
}

View Gist on GitHub

If you’re not familiar with bitCount, it’s also called the Hamming Weight or population count of a binary value. The Wikipedia entry gives a decent explanation. Actually, the listing above is the standard pure-Java OpenJDK implementation, so if and until HotSpot decides to turn it into something that may be better suited to your CPU architecture, that’s what you get (there are arguably more optimal implementations depending on the input).

Let’s run a super-naïve benchmark of myBitCount against plain old Long.bitCount to see if there’s any difference in execution time:

package com.logentries.blog;

public class TinyBenchmark {

    private long x = 0;
    public long y = 0;

    public long standardBitCount() {
        y = Long.bitCount(x++);
        return y;
    }

    public long handRolledBitCount() {
        y = myBitCount(x++);
        return y;
    }

    public static int myBitCount(long i) {
        i = i - ((i >>> 1) & 0x5555555555555555L);
        i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
        i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
        i = i + (i >>> 8);
        i = i + (i >>> 16);
        i = i + (i >>> 32);
        return (int)i & 0x7f;
    }

    public static void benchmark(Runnable work) {
        long start = System.nanoTime();

        for (long i = 0; i < 1E10; i++) {
            work.run();
        }

        double total = (System.nanoTime() - start) / 1E6;
        System.out.println("Took " + total + " seconds");
    }

    public static void main(String[] args) {
        TinyBenchmark m = new TinyBenchmark();

        benchmark(() -> m.standardBitCount());
        // => Took 13156.933093 seconds

        benchmark(() -> m.handRolledBitCount());
        // => Took 33284.156043 seconds
    }
}

package com.logentries.blog;

public class TinyBenchmark {

    private long x = 0;
    public long y = 0;

    public long standardBitCount() {
        y = Long.bitCount(x++);
        return y;
    }

    public long handRolledBitCount() {
        y = myBitCount(x++);
        return y;
    }

    public static int myBitCount(long i) {
        i = i - ((i >>> 1) & 0x5555555555555555L);
        i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
        i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
        i = i + (i >>> 8);
        i = i + (i >>> 16);
        i = i + (i >>> 32);
        return (int)i & 0x7f;
    }

    public static void benchmark(Runnable work) {
        long start = System.nanoTime();

        for (long i = 0; i < 1E10; i++) {
            work.run();
        }

        double total = (System.nanoTime() - start) / 1E6;
        System.out.println("Took " + total + " seconds");
    }

    public static void main(String[] args) {
        TinyBenchmark m = new TinyBenchmark();

        benchmark(() -> m.standardBitCount());
        // => Took 13156.933093 seconds

        benchmark(() -> m.handRolledBitCount());
        // => Took 33284.156043 seconds
    }
}

View Gist on GitHub

Wow, the built-in version is nearly three times faster than our implementation! How is this even possible if ‘my’ implementation is taken straight from OpenJDK’s in the first place? It turns out that the JVM can rewrite certain methods if it finds they’re being invoked often enough. On x86-64 architectures the whole method can actually be performed by a single ‘intrinsic’ instruction, POPCNT which Intel introduced with SSE4 (most modern Intel Core-based architectures will have this).

Intrinsic?

Think of an intrinsic as a shorthand version of a bag of CPU instructions. They tend to be massively faster than the ‘long-form’ equivalents because they can be performed in one fetch-decode-execute cycle.

POPCNT is one of those cases, but can we actually see this happening in our tiny little benchmark? With a little disassembler called hsdis, we can indeed. I’ve described previously how you set this up so I won’t duplicate it here, but suffice to say it’s not that hard. Let’s look at the pertinent assembly for the standard bitCount function:

  0x000000010c57124c: mov    0x10(%rsi),%r10    ;*getfield x
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@3 (line 9)

  0x000000010c571250: mov    %r10,%r11
  0x000000010c571253: add    $0x1,%r11
  0x000000010c571257: mov    %r11,0x10(%rsi)    ;*putfield x
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@9 (line 9)

  0x000000010c57125b: popcnt %r10,%r10
  0x000000010c571260: movslq %r10d,%rax         ;*i2l  ; - com.logentries.blog.TinyBenchmark::standardBitCount@15 (line 9)

  0x000000010c571263: mov    %rax,0x18(%rsi)    ;*putfield y
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@16 (line 9)

  0x000000010c57124c: mov    0x10(%rsi),%r10    ;*getfield x
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@3 (line 9)

  0x000000010c571250: mov    %r10,%r11
  0x000000010c571253: add    $0x1,%r11
  0x000000010c571257: mov    %r11,0x10(%rsi)    ;*putfield x
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@9 (line 9)

  0x000000010c57125b: popcnt %r10,%r10
  0x000000010c571260: movslq %r10d,%rax         ;*i2l  ; - com.logentries.blog.TinyBenchmark::standardBitCount@15 (line 9)

  0x000000010c571263: mov    %rax,0x18(%rsi)    ;*putfield y
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@16 (line 9)

View Gist on GitHub

Great- between the load / store operations for our fields, we’re clearly calculating bitCount in one shot! How about our hand-rolled contender?

                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@-1 (line 19)

  0x0000000102773c8c: mov    %rdx,%r10
  0x0000000102773c8f: shr    %r10
  0x0000000102773c92: movabs $0xf0f0f0f0f0f0f0f,%r11
  0x0000000102773c9c: movabs $0x5555555555555555,%r8
  0x0000000102773ca6: and    %r8,%r10
  0x0000000102773ca9: sub    %r10,%rdx          ;*lsub
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@8 (line 19)

  0x0000000102773cac: mov    %rdx,%r10
  0x0000000102773caf: shr    $0x2,%r10
  0x0000000102773cb3: movabs $0x3333333333333333,%r8
  0x0000000102773cbd: and    %r8,%rdx
  0x0000000102773cc0: and    %r8,%r10
  0x0000000102773cc3: add    %r10,%rdx          ;*ladd
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@22 (line 20)

  0x0000000102773cc6: mov    %rdx,%r10
  0x0000000102773cc9: shr    $0x4,%r10
  0x0000000102773ccd: add    %rdx,%r10          ;*ladd
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@28 (line 21)

  0x0000000102773cd0: mov    %r10,%r8
  0x0000000102773cd3: and    %r11,%r8
  0x0000000102773cd6: shr    $0x8,%r10

                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@-1 (line 19)

  0x0000000102773c8c: mov    %rdx,%r10
  0x0000000102773c8f: shr    %r10
  0x0000000102773c92: movabs $0xf0f0f0f0f0f0f0f,%r11
  0x0000000102773c9c: movabs $0x5555555555555555,%r8
  0x0000000102773ca6: and    %r8,%r10
  0x0000000102773ca9: sub    %r10,%rdx          ;*lsub
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@8 (line 19)

  0x0000000102773cac: mov    %rdx,%r10
  0x0000000102773caf: shr    $0x2,%r10
  0x0000000102773cb3: movabs $0x3333333333333333,%r8
  0x0000000102773cbd: and    %r8,%rdx
  0x0000000102773cc0: and    %r8,%r10
  0x0000000102773cc3: add    %r10,%rdx          ;*ladd
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@22 (line 20)

  0x0000000102773cc6: mov    %rdx,%r10
  0x0000000102773cc9: shr    $0x4,%r10
  0x0000000102773ccd: add    %rdx,%r10          ;*ladd
                                                ; - com.logentries.blog.TinyBenchmark::myBitCount@28 (line 21)

  0x0000000102773cd0: mov    %r10,%r8
  0x0000000102773cd3: and    %r11,%r8
  0x0000000102773cd6: shr    $0x8,%r10

View Gist on GitHub

Ugh, I stopped copying & pasting after a couple of lines, but you can clearly see our program has to do a hell of a lot more work: it’s responsible for all the requisite loads / stores plus arithmetic in between.

But wait, both implementations start off with the same Java code, how come our one gets the rough treatment? When HotSpot kicks in and starts traversing the AST of the Java code, it looks for call sites matching those certain functions I was talking about earlier. If you want to know what they are, they live in the OpenJDK vmsymbols header file (link here). Look for the do_intrinsic macro and you’ll find a lot of methods for operating on numbers and memory regions are optimized.

Remember that ‘warmup period’?

Just to highlight that we’re specifically talking about HotSpot optimizations here, let’s see what happens if we drop the number of loop iterations to 1 and run the standard bitCount again:

CompilerOracle: print *TinyBenchmark.standardBitCount

Java HotSpot(TM) 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output

Took 0.055956 seconds

Process finished with exit code 0

That’s right, we get nothing at all; our loop hasn’t been run through often enough so the JVM is just interpreting the Java byte code on the fly. How about bumping the number slightly, say to 1000?

                                             ; - com.logentries.blog.TinyBenchmark::standardBitCount@9 (line 9)

  0x000000010d58adf6: movabs $0x12612c828,%rbx  ;   {metadata(method data for {method} {0x0000000126119658} 'standardBitCount' '()J' in 'com/logentries/blog/TinyBenchmark')}
  0x000000010d58ae00: addq   $0x1,0x108(%rbx)
  0x000000010d58ae08: mov    %rsi,0x48(%rsp)
  0x000000010d58ae0d: mov    %rdi,%rsi          ;*invokestatic bitCount
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@12 (line 9)

  0x000000010d58ae10: nop

(…some nop calls redacted…)

  0x000000010d58ae17: callq  0x000000010d446420  ; OopMap{[72]=Oop off=156}
                                                ;*invokestatic bitCount
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@12 (line 9)
                                                ;   {static_call}
  0x000000010d58ae1c: movslq %eax,%rax
  0x000000010d58ae1f: mov    0x48(%rsp),%rsi
  0x000000010d58ae24: mov    %rax,0x18(%rsi)    ;*putfield y
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@16 (line 9)

                                             ; - com.logentries.blog.TinyBenchmark::standardBitCount@9 (line 9)

  0x000000010d58adf6: movabs $0x12612c828,%rbx  ;   {metadata(method data for {method} {0x0000000126119658} 'standardBitCount' '()J' in 'com/logentries/blog/TinyBenchmark')}
  0x000000010d58ae00: addq   $0x1,0x108(%rbx)
  0x000000010d58ae08: mov    %rsi,0x48(%rsp)
  0x000000010d58ae0d: mov    %rdi,%rsi          ;*invokestatic bitCount
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@12 (line 9)

  0x000000010d58ae10: nop

(…some nop calls redacted…)

  0x000000010d58ae17: callq  0x000000010d446420  ; OopMap{[72]=Oop off=156}
                                                ;*invokestatic bitCount
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@12 (line 9)
                                                ;   {static_call}
  0x000000010d58ae1c: movslq %eax,%rax
  0x000000010d58ae1f: mov    0x48(%rsp),%rsi
  0x000000010d58ae24: mov    %rax,0x18(%rsi)    ;*putfield y
                                                ; - com.logentries.blog.TinyBenchmark::standardBitCount@16 (line 9)

View Gist on GitHub

Heh, it’s started generating executable code, but it hasn’t yet compiled the Long.bitCount method, nor has it been inlined into our standardBitCount(). This is actually part of the tiered compilation feature that was introduced in Java 7 to mitigate longer startup times when the JVM is running in server mode. It’ll generate JIT’ed code earlier but avoids a heavier optimization phase. For bonus points, try adding the –XX:-TieredCompilation VM flag to disable it- a thousand iterations should also now produce no generated code for you!

Enough machine code already!

Right- like false sharing, you probably shouldn’t let tricks like this dictate design decisions in your code but if you’ve worked with the JVM for a while, it’s great to be aware of some of the compiler optimizations going on downstairs. At Logentries we have a critical set of performance challenges where experience with low-level I/O and CPU behavior is extremely important, so being able to see how a language runtime ties into this is massively beneficial.

Article Tags

Related blog posts

Taking a Message-Based Approach to Logging

6 Best Practices for Effective IT Troubleshooting

3 Steps to Building an Effective Log Management Policy

3 Core Responsibilities for the Modern IT Operations Manager

Related blog posts

Taking a Message-Based Approach to Logging

6 Best Practices for Effective IT Troubleshooting

3 Steps to Building an Effective Log Management Policy

3 Core Responsibilities for the Modern IT Operations Manager