C interpreters underlie many of our most widely used language implementations -- but they're slow.

ltratt@mastodon.social

C interpreters underlie many of our most widely used language implementations -- but they're slow. Wouldn't it be great if we could turn them into JIT compiling VMs? This video shows what happens when we do just that to the normal Lua VM (first) and "yklua" (Lua w/JIT, second).

ltratt@mastodon.social

This isn't just a technique for Lua, though -- it works for any C interpreter compilable with LLVM! More about how and why in this new post 'Retrofitting JIT Compilers into C Interpreters' looking at our new 'yk' system. https://tratt.net/laurie/blog/2026/retrofitting_jit_compilers_into_c_interpreters.html

ltratt@mastodon.social

I want to say a big thanks to Shopify and the Royal Academy of Engineering who graciously funded this research. I'd like to dedicate this work to the late Chris Seaton, who was an early champion of yk: he is much missed by me and many others.

david_chisnall@infosec.exchange

@ltratt

This hopefully makes the trade-off yk offers clear: yklua does not reach the performance peaks of the wonderful, carefully hand-written, LuaJIT

For what it's worth:

In igk, I use sol3, which lets you select the Lua implementation as a build-time option. I don't use any fancy new Lua features in this (64-bit integers are really important for some other things where I looked at Lua, but not for igk), so I tried both Lua and LuaJIT. There wasn't much difference in terms of performance, but LuaJIT was a bit slower than the interpreter.

My guess is that this is primarily because FFI is slower with LuaJIT and my code did a lot of FFI (basically everything it's doing is calling back into C++ to manipulate the text tree).

I presume that yklua uses exactly the same memory layout as the C version, so I'd expect it to be better here.

This is also a problem with a lot of Python JITs: If you make Python faster and make CPython-compatible FFI slower, you generally make Python programs slower.

llimllib@hachyderm.io

@ltratt the videos in your post are not working on my phone

(Cool work!)

ltratt@mastodon.social

@llimllib Which browser? They work on my Android phone's browsers, but video compatibility beyond that is a bit of an unknown to me.

david_chisnall@infosec.exchange

@ltratt

There's another approach that's worth mentioning, popularised by Apple's old shader JIT, which looks like a more ad-hoc version of what you've built.

Each operation was written as a function that took a pointer to the interpreter state and updated it. The interpreter is then a big switch statement calling these functions. These typically all get inlined so you end up with one massive function that runs in a loop.

To build the JIT, you compile those individual functions to LLVM IR, then JIT compile a function that is equivalent to the calls of a sequence of bytecode. The normal LLVM optimisers can then inline small or infrequently-used opcode bodies, and optimise across the whole program (or whole function, trace, or whatever else you want to JIT). The JIT'd code has the same interpreter state (though may update it only at the end of a trace - apparently marking it as not-aliasing-anything gets you around 10% extra performance), so you can JIT whatever size fragment makes sense.

llimllib@hachyderm.io

@ltratt iPhone, I’m not sure how to get error output so this is a terrible bug report

ltratt@mastodon.social

@david_chisnall I assumed that LuaJIT did quite a good job with FFI performance (the API it defined has spread more widely), but I haven't benchmarked it! That said, there are some heuristics in LuaJIT that do not always play well with real-world code.

yklua will just do whatever PUC Lua does, but it will probably inline right up until the FFI call, which might help. That said, right now, you can still hit missing bits that tank performance in any yk interpreter, so it's difficult to say!

ltratt@mastodon.social

@david_chisnall A very early prototype of yk used LLVM for these purposes, but the compilation performance was awful (from memory something like 1000x worse than we needed). It's not really LLVM's fault though: we were feeding it an input it never expected to see. [We also encountered multiple threading bugs, but I imagine those have been fixed in the interim.]

CIRCLE WITH A DOT

C interpreters underlie many of our most widely used language implementations -- but they're slow.