Mixing NodeJS and OpenJDK

Language interop and vertical architecture

Press enter or click to view image in full size

The tower of Babel, being scaled vertically.

This article is about a new(ish) way to write software that might become popular in future, but probably not quite yet. It has code, I promise!

Way back in the ancient times of 2015 I wrote about why Kotlin is my next programming language and in 2016 I wrote about Graal and Truffle: two radical compiler research projects that are not only significantly accelerating the performance of languages like Ruby, but also making seamless language interop a reality. They work by replacing the JIT compiler of OpenJDK with a new one that understands how to turn annotated interpreters into state of the art JIT compilers … automatically.

Revisiting these topics in 2019 I want to do three things:

Show you how to use a small library I wrote to near-seamlessly use NPM modules from inside your Java or Kotlin programs.
Explain all the good reasons why you might want to do that, even if you feel JavaScript/Java is the worst thing since unsliced bread.
Briefly explore the concept of vertical architecture, a competitor to microservices oriented design. It sits at the intersection of the very latest GraalVM and OpenJDK versions, combined with the very latest hardware.

Using NPM modules from Java and Kotlin

Three easy steps are all it takes:

Grab GraalVM. It’s a patchset on top of OpenJDK that drops right in; it can run all your existing JVM bytecode.
Grab my little tool NodeJVM from github and add it to your path.
Replace java in your command line with nodejvm. That’s it!

Alright alright. I admit it. I exaggerate for effect, something you will have to put up with multiple times to get through this article. Obviously it’s not quite that easy: you’ve still got to pick a module and use it.

Let’s see what that looks like:

Press enter or click to view image in full size

A sample of code that uses NodeJVM

Study this image carefully. Yes, that is what it looks like: Kotlin with an embedded multiline string containing JavaScript being autocompleted, statically analysed and syntax highlighted correctly. The same thing works from Java or other JVM languages that IntelliJ understands. To get this you need to flip a switch in the IDE settings (consult the NodeJVM readme for instructions), but after that it works automatically. As long as IntelliJ can figure out via dataflow analysis that your string ends up being passed to the run or eval methods, it’ll be treated as embedded JS.

I’m going to discuss the Kotlin API here because it’s a bit nicer and more convenient than the plain Java API, but everything I describe below can be done from Java too.

There are a few features to observe in the above code:

To access JavaScript I must use a nodejs {} block. This is because JavaScript is single threaded, and thus to run NPM modules we must “enter the Node thread”. The nodejs {} block does this synchronisation for us, regardless of what thread we’re on. So we must always remember: to run any JS code at all, we must be inside such a block. We can re-enter as many times as we like so it’s safe to use one of these anywhere you need one. Any callbacks from JavaScript will run on the Node thread and thus block other threads from entering a nodejs block until they’re done, so if you care about performance or smooth GUI rendering you probably want to avoid long running operations in callbacks.
The var x by bind(SomeObject()) syntax is only available inside a nodejs block and lets me connect a Kotlin variable to a variable of the same name in JavaScript’s global scope. Changing x from Kotlin will change it in JS and vice-versa. Here I’m binding an ordinary Java File object into JS world.
The eval method returns … whatever we ask it to return, yet in a statically typed way. It’s a generic function and by simply specifying the type of the thing we’re assigning it to, eval will automatically cast the JavaScript object to a statically typed Java/Kotlin/Scala/etc class or interface. Although you can’t see it above MemoryUsage is a simple interface type I’ve defined that has rss() and heapTotal() functions, these map to the JavaScript properties of the same name on what you get back from Node’s process.memoryUsage() API. Most JS types can be cast to ‘reasonable’ Java types this way; the precise documentation on what works is available from the GraalVM website. The resulting objects can be stored anywhere, although calling methods on them must be done inside a nodejs block of course.
JavaScript objects can also be treated as simple String→Object maps, which is in many ways what they are. Those String→Object maps can, in turn, then be cast back to some more strongly typed thing, as can be seen inside the callback. Use whichever representation you prefer.
I can use require and it will search for modules in the node_modules directories in the usual manner.

The code snippet above uses the DAT protocol to join a P2P network that’s loosely comparable to BitTorrent, and then hunt down peers that are serving a particular file. I use DAT as an example because (a) it’s decentralised thus inherently chic, and (b) for better or worse the reference implementation is written in JavaScript. It’s not a program I could have written JS-free in any kind of reasonable timeframe.

You may also do it from Java:

The Java API doesn’t give you the nice variable binding and auto-casting of the Kotlin API, but it’s serviceable enough. Here you can see we’re casting the result to a Java int and returning it “out” of the Node thread: the main Java thread here is not the same as the NodeJS thread, but we’re switched to and from it seamlessly.

NodeJVM is a very, very small wrapper on top of GraalVM. It adds a tiny amount of code, so don’t worry about it becoming unmaintained or going away: 99.99% of the heavy lifting is done by the GraalVM team.

There are some obvious ideas for improvement:

Allow JS modules to import Java modules by Maven coordinate.
Figure out some best practices for Javifying NPM modules. For instance can a JAR file contain a node_modules directory (quick answer: no, because NodeJS still does its own file IO and it doesn’t know about zips, long answer: yes with enough work).
More languages: Python and Ruby don’t need the same kind of thread synchronisation glue that NodeJS does, so you can just use the ordinary GraalVM Polyglot API. But for Kotlin users the casting/extension methods and variable binding APIs would be nice to have for any language.
Windows support.
Gradle plugin so programs can have mixed language dependency lists.
Integration with the native-image tool a.k.a. SubstrateVM so if you don’t need the full runtime performance of HotSpot you can ship small statically linked binaries, Golang style.
Maybe some kind of TypeScript to Java converter, so you can use DefinitelyTyped to get into static-world faster.

Patches welcome.

Why would you do this?

Are you thinking, “yay JavaScript, we devs can now live in mutual love, respect and harmony!”

Idealised fantasy reaction

Quite possibly you’ll be looking more like this:

JavaScript and Java aren’t just languages. They’re cultures, and there’s nothing developers love more than CULTURE WAR!

Get Mike Hearn’s stories in your inbox

Join Medium for free to get updates from this writer.

Here’s why you should at least bookmark this page for future reference, even if the idea of $OTHER_LANG invading your precious ecosystem makes you feel violent:

If you’re primarily a Java developer, you now have easy access to unique JavaScript modules that may have no JVM equivalent (e.g. the DAT protocol). Love it or loathe it, a lot of people are writing open source NPM modules and some of them are pretty good. You can also easily reuse code that runs in your web frontends without needing language transpilers. And if you have a legacy NodeJS codebase you’d like to incrementally port to Java, suddenly it got a lot easier.
If you’re primarily a JavaScript developer, you now have easy access to unique JVM libraries that may either have no direct equivalent (e.g. Lucene, Chronicle Map) or have only poorly documented, immature or lower performance equivalents. If you’d like to avoid HTML for your next project you can explore using a real man’s GUI framework. You also have access to lots of other languages, like Ruby and R. JVM objects can be shared between NodeJS workers so you can benefit from shared memory multithreading if your profiling shows you could use it. And if you have a legacy Java codebase you’d like to incrementally port to NodeJS, suddenly it got a lot easier.
If you’re a student of all languages you can now do polyglot programming. Polyglot programmers aren’t haters, they make friends with the best available code whatever culture it comes from. They’re the Renaissance scholars of our time: English, French, Latin … it’s all the same to them. They mix Java, Kotlin, JavaScript, Scala, Python, Ruby, Lisp, R, Rust, Smalltalk, C/C++ and even FORTRAN libraries together into a seamless whole on top of the GraalVM. They don’t give themselves RSI trying to duplicate existing projects, just because they prefer a different language — they spend the time they save on the beach, or in bed making passionate love to their partner.

Finally, even if you are a happy NodeJS user who doesn’t care about other languages at all you may want to experiment with GraalVM anyway.

NodeJS is based on V8, a virtual machine designed for short lived, single-threaded scripts running on desktops and phones. That’s what Google funds, yet V8 is also being used on servers. OpenJDK has been optimised for servers for decades. The latest versions contain ZGC and Shenandoah, two ultra-low latency garbage collectors that let you consume terabytes of memory whilst experiencing pauses of only a few milliseconds. So you might find you can get cost savings by exploiting the GraalVM’s excellent infrastructure and tools, even if you don’t ever leave the monolingual world.

Press enter or click to view image in full size

Viewing a heap containing Ruby objects

Press enter or click to view image in full size

CPU sampling accessible over HTTP

Press enter or click to view image in full size

Ultra-deep expert diagnostics, showing how code was optimised

Vertical architecture

There’s one final topic that’s worth discussing.

Sometimes I tell the above to people and they say, “that’s great, but don’t microservices already give us this? what’s the big deal?”. It’s hard to explain why I find polyglot programming compelling without also explaining why I find microservices architectures in need of competition.

Firstly, yes, there are times when you need lots of servers running lots of services working together. I spent over 7 years at Google and worked with their Borg container orchestrator almost every day. I wrote ‘microservices’ although we didn’t call them that, and I consumed them. There was no alternative because our workloads required thousands of machines!

But these architectures come with heavy costs:

Serialisation. That imposes performance penalties but more importantly, requires you to constantly flatten your at least somewhat typed and optimised data structures to simple trees. If you use JSON you lose the ability to do basic things, like have lots of small objects point to a few large objects (you’d need to use custom indexes to avoid repetition).
Versioning. This is hard. Universities often don’t teach this kind of difficult-but-mundane software engineering discipline, so even if you think you’ve totally nailed the difference between forwards and backwards compatibility, even if you’re sure you understand multi-phase rollouts, can you guarantee everyone who replaces you will? Are you properly integration testing the different version combinations that can occur during a non-atomic rollout? A significant number of the disasters I’ve seen in distributed architectures boiled down to version skew.
Consistency. Atomic operations inside a single server are quite easy. Making sure users always see a fully consistent view when multiple machines get into the mix, especially if data is sharded amongst them, well — it gets a lot harder. That’s why historically relational database engines didn’t scale well. Again, even if you’re sure you’ve got it 100% nailed, are you sure everyone you hire will, stretching into the future? I’ll give you a hint: Google’s top engineers have spent decades trying to simplify distributed programming for their teams, by making it look more like traditional programming.
Reimplementation. Because RPCs are expensive you can’t do many of them, so for some kinds of task you have no choice but to reimplement code. Google has some libraries that you use across languages by making RPCs, but for others they had to recode from scratch.

So what’s the alternative?

Put simply, really big iron. This may sound absurdly retro but consider that the cost of hardware falls constantly, many workloads are not ‘web scale’ and your intuitions about what is reasonable may be out of date.

Here’s a recent price list from a Canadian vendor:

Press enter or click to view image in full size

A machine with 40 cores, a terabyte of RAM and nearly a terabyte of hard disk goes for about $6k these days. The average salary for a software engineer in New York is about $132,000, so one of these machines costs a less than two weeks of that person’s time. Think about how much time your team will spend over the lifetime of your project on distributed systems issues and what that will cost.

Yeah, but isn’t everything web scale these days?

In a word, no.

The world is full of companies characterised by the following attributes:

They’re in stable markets.
They charge money for things.
As a consequence their customer base is somewhere between the low tens and the tens of millions, not billions.
Their datasets are mostly about their own customers and products.

A good example of this kind of company would be a bank. Banks do not experience ‘hypergrowth’; they don’t go viral. Their growth is modest and predictable, assuming they’re growing at all (banks are regional and usually in saturated markets). The largest bank in the USA has on the order of 50 million users of its website and it’s not going to double within six months. This isn’t Instagram we’re talking about. So it’s not entirely surprising that many banks still have a mainframe somewhere at their core. Of course the same is true of shipping firms, manufacturing firms, etc. The bread and butter of our economy.

In these sorts of businesses it’s plausible that for any particular app their needs fit entirely in a single big machine and always will. Heck, even public websites that give it away can fit these days. In his entertaining 2015 talk “The website obesity crisis” Maciej Cegłowski observes that whilst his own self-hosted bookmarking website was profitable, his competitor hosted on AWS was unprofitable just due to differing hardware costs and complexity assumptions. In Scaling Up vs Scaling Out it was revealed that PlentyOfFish was running on ~one megaserver (the article dates from 2009 so ignore the quoted hardware prices). The author does some calculations and shows it’s not as dumb as it sounds. Finally, in case you’re thinking about Hadoop and Big Data, this Microsoft Research paper from 2013 shows that many Hadoop workloads from Microsoft, Yahoo and Facebook actually run much faster and more efficiently on one big machine than on a cluster. And that was 6 years ago! The economics have probably shifted even more in favour of scale-up since then.

But the real savings don’t come from hardware costs. The real savings come from optimising the ultra-expensive engineering time it takes to build lots of tiny microservices scaling horizontally with elastic demand management. That type of engineering is risky and time consuming even if you use all the latest toys in the cloud. You might lose SQL, solid transactions, unified profiling and you’ll for sure lose things like cross-system stack traces. Type safety will vanish every time you cross server boundaries. You’ll get function calls that can time out, redundant JIT compiler overheads, unexpected loss of backpressure, complex orchestration engines with bizarre pseudo-programmable config formats and … oh my, this really brings back the memories. It was fun to work on these systems when I had Google’s proprietary infrastructure and large engineering budgets to play with, but I’d only do it these days if I had no other choice.

It used to be infeasible to run very large garbage collected servers because GC technology wasn’t good enough, so this topic has been academic for long periods of time — you were gonna run multiple servers no matter what so you may as well embrace it. But with the arrival of ZGC & Shenandoah, terabyte+ heaps running on 80 hypercores in a single process becomes entirely reasonable: your users won’t notice any hiccups. Buy a few big boxes, run your business logic on one, a database server on another and see how far you can get.

So do you really need all your microservices? Or would a careful cost/benefit analysis reveal potential simplifications? Vertical architecture is about bringing old school back: saving money, reusing code and accelerating your team’s velocity by combining the latest cross-language compiler technology with the latest hardware … in traditional ways.

Conclusion

NodeJVM is a modest demo of what can be done with the GraalVM. It lets you require NPM modules from Java/Kotlin code, cast JS objects to interfaces, bind Kotlin local variables to JavaScript variables, use callbacks and run JS at similar speeds to V8.

You can use it to pick’n’mix modules, port JS codebases to Java, port Java codebases to JS or just for starting language flamewars with your coworkers.

Language interop is often obtained via microservice architectures, but GCs with max pause times of 4 milliseconds regardless of heap size can be combined with stunningly cheap hardware to enable scale-up to be revisited as a design pattern for many common workloads. The simplicity and tooling benefits can be significant.