Ruby 3 JIT can make Rails faster

I’ve wondered Why Rails becomes slow with JIT for a long time. Today, I’m pleased to share my answer to the question in this article, which I spent three years of my life to figure out.

RubyKaig 2018 / The Method JIT Compiler for Ruby 2.6

"MJIT Does Not Improve Rails Performance"

it is still not ready for optimizing workloads like Rails, which often spend time on so many methods and therefore suffer from i-cache misses exacerbated by JIT.

We thought Ruby 3’s JIT a.k.a. MJIT Does Not Improve Rails Performance and Rails Ruby Bench gets slower, not faster. That’s what I observed last year too.

Because of that and to support environments without compilers, Matz is thinking about introducing JIT that does not invoke compilers, MIR or YJIT.

Why does Rails become slower on Ruby 3’s JIT?

I mean, I've literally thought about optimizing Rails by MJIT for over 3 years, and I've introduced several optimizations designed for such workloads. We do rely on GCC to ease our maintenance, but the reason why it optimizes some benchmarks is that it implements various Ruby VM-level optimizations that don’t really depend on GCC, not just that it uses GCC. Most of RTL-MJIT’s optimizations were already ported to or experimented with YARV-MJIT after we shuffled the code to make it work with YARV, and it’s now faster than RTL-MJIT because of that and many other improvements after it.

Then… why?

Is it because we use C compilers?

First, using a C compiler for code generation makes it hard to implement some low-level techniques, such as calculating a VM program counter from a native instruction pointer, retrieving VM stack values from a native stack, patching machine code, etc. One such optimization that YJIT implements and MJIT doesn’t is dispatching JIT-ed code from direct threading just like normal VM instructions. I ported the technique to MJIT and even tested generating a whole method with an inline assembler, but it didn’t seem to help the situation and MJIT’s dispatch wasn’t really slower than that.

Second, there’s an overhead caused by using dlopen as a loader. Whenever you call dlopen, it creates multiple memory pages for different ELF sections. More importantly, methods generated by different dlopen calls are on different pages. However, we removed the latter problem by periodically recompiling all methods to a single binary, which I call "JIT compaction". Also, function calls from a shared object are a bit slow. In fact, Ruby becomes a little slower if you compile Ruby with --enable-shared. The same overhead applies to MJIT. Though it’s usually just a couple of cycles after the first call is made and dynamic resolution is finished. I also tested loading .o files using shinh/objfcn, but compiling all methods in a single file and using dlopen weren’t slower than loading .o files.

i-cache misses by compiling many methods?

However, Shopify’s YJIT, which compiles every method called more than once, achieved the same performance as the interpreter after compiling 4000 methods. Which, doesn’t really make sense if you assume the more methods you JIT, the slower the code runs.

What if we just compile all methods like YJIT? I tested it in older versions and it didn’t go well, which is why I haven’t tried it again until very recently. But in Ruby 3.0, I implemented an optimization that eliminates most of the code duplication across different methods and significantly reduces code size. With that, could MJIT-generated code perform well just like YJIT?

The “compile all” magic

See this gist for benchmark details. Note that some benchmarks are forked by me for reasons explained later. Following results are measured under a single-threaded condition and run at a Rack level.

Sinatra Benchmark

Guess what? JIT makes it 11% faster instead of slower now.

Rails Simpler Bench

JIT makes it 1.04x faster. The difference is not as significant as Sinatra, but this is the first time we see the JIT makes Rails faster, not slower.

Railsbench

Despite its simplicity and benchmark convenience, it captures many real-world Rails characteristics. It uses ActiveRecord, for instance.

It's 1.03x faster. Not significant, but it's way better than a slowdown, right?

Discourse

So it's pretty clear that Discourse is a real-world Ruby workload. The important thing in this context is that it has its own benchmark harness. Noah Gibbs's Rails Ruby Bench is based on Discourse too.

JIT can make Discourse 1.03x faster! One complication here, however, is that Discourse enables TracePoint and all JIT-ed code is canceled when TracePoint is enabled. Because Zeitwerk uses TracePoint, I needed to switch its autoloader back to the classic autoloader. So it's not ready to use the JIT for Discourse yet. But this proves the JIT is useful for real-world application logic.

Why does compiling everything make it faster?

Let's take a look at how the number of compiled methods impacts the performance with Railsbench.

The number of benchmarked methods was roughly 1200. When you have only VM interpretation, it's faster than most of the VM/JIT-mixed execution. But once you compile everything and use JIT-ed code most of the time, it actually becomes faster than the baseline.

What Maxime, the author of YJIT, thought was that thrashing between VM and JIT is the biggest issue because they could evict each other from i-cache and less-predictable VM dispatch makes it harder to prefetch code. She also proved in her 30k_methods benchmark that large code can run faster as long as you stay in linear predictable JIT-ed code. MJIT would share many of these characteristics too even while its code may not be as linear as YJIT's.

I think our significant code size reduction in Ruby 3.0 contributed to making this possible as well.

So, should I use JIT on Rails?

Ruby 3.0.1 bug that's not in 3.0.0

Ruby 3 bug that stops compilation in the middle

Incompatibility with Zeitwerk / TracePoint

However, Byroot has found that it doesn't have a significant impact on VM performance nowadays. So the problem is only associated with JIT. Given that Zeitwerk disables TracePoint after eager loading finishes, now I'm thinking about re-enabling all JIT-ed code when all TracePoints get disabled.

Incompatibility with GC.compact

The default value of --jit-max-cache

Scalability of "JIT compaction"

Also, it takes a lot of time before "JIT compaction" is triggered. See this benchmark result for how long MJIT takes to reach the peak performance. If you measure MJIT's performance while GCC is running or before "JIT compaction", it would be often slower than the baseline.

I hope using MIR instead of GCC will shorten the warmup time in the future.

Next steps

In addition to approaching the problems discussed above, I'm thinking about working on the following things, which were slightly changed from the previous article.

Ruby-based JIT compiler

However, I found that running multiple Ractors has its own overhead unlike running only the default Ractor. So now I'm thinking to continue to maintain the JIT worker in C, but starting and stopping a Ractor whenever the JIT compilation happens so that we won't see the multi-Ractor overhead once it reaches the peak performance. We'd need to check if the overhead is insignificant compared to GCC's compilation time first though.

Faster deoptimization

One of the reasons why we don't support TracePoint and GC.compact was to avoid adding too many branches in the code for deoptimization. If there's no extra overhead for doing it, it'll be easier to introduce the support.

Lazy stack frame push

Sponsors

https://github.com/sponsors/k0kubun