JIT Compilation: Understanding and Implementation

JIT Compilation: Understanding and Implementation

(Transfer from personal blog, original text: JIT Compilation: understanding and realization )

This article mainly introduces the JIT Compilation technology in the basic compilation technology, and how to use C++ to quickly build a simple JIT Compiler.

About a year ago, the idea of " writing an article on how the JIT Compiler works " has been lying in my TODO List and can't be clicked, but now I finally have time to put it into action. Then through this article, I hope to let you know the following:

  • What is JIT Compilation technology? What are its characteristics?
  • How to use C++ to implement a JIT Compiler without relying on any framework?

However, due to space and subject scope, this article will not cover the following:

  • How to write a complete Interpreter/Compiler?
  • Related advanced compilation optimization technology.

Since writing JIT Compiler involves many aspects of knowledge from high-level programming languages such as C/C++, assembly, computer architecture, and operating systems, I will assume that readers already have basic knowledge in these fields. When the relevant content is actually involved in the article, I will also give a brief introduction.

In the example that will be explained later in this article, considering completeness and in order to facilitate Benchmark, we will implement a simple JIT Compiler for a real programming language called Brainfuck . At the same time, we will also implement a corresponding Interpreter for it to compare the differences in the overall code execution efficiency between JIT Compilation and Interpretation. For the specific implementation details of the Interpreter part, you can refer to the source code given in the warehouse where the example is located. Due to space limitations, this article will not go into details. Before we officially start, here are some things you need to know in advance to continue reading:

  • Code repository: github.com/Becavalier/...
  • The JIT Compiler we build will use X86-64 as the target platform, which can run on macOS and Linux systems;
  • Since there are many codes in the Compiler implementation part, this article will selectively introduce them. Please refer to the above-mentioned warehouse for the complete code.

Okay, let's get started.

Brainfuck programming language

Brainfuck is a very special programming language from the point of view of its name. It was created by Urban M ller in 1993. The word brainfuck itself is a slang term , usually used to refer to things that are beyond people's understanding, very complex and rare, and this is the language of Brainfuck. For example, the following Brainfuck code will output the characters "Hello, world!" to the console after normal execution.

++ + + + + + + + + [ > + + + + + + > + + + + + + + + + + + > + + + > + <<<< - ] > + + . > + . + + + + + + + . . + + . > ++ . << + + + + + + + + + + ++ + + + . > . + + + . - - - - - . - - - - - - - . > + . > . Copy the code

It can be seen that by identifying the code itself with the naked eye, we cannot know the intention of the entire program at all, which also reflects the meaning of the name of the Brainfuck language. Even so, the Brainfuck language itself is indeed a Turing complete minimalist programming language . The Brainfuck language only consists of 8 different instructions, and all programs written in this language contain different instruction sequences composed of these 8 different instructions. When the program is running, the sequence of instructions contained in it will be executed sequentially. In addition, Brainfuck's execution model is also very simple. In addition to these 8 different instructions, the program also maintains a one-dimensional byte array containing at least 30,000 units (we will call it " paper tape " for short ) during execution . When the program is initially executed, all cells in the array will be initialized to the value 0, and a " data pointer " that can move back and forth will default to the first cell of the array. When the program is running, it will move the data pointer back and forth according to different instructions, and update or use the contents of the currently pointed cell accordingly. Regarding the eight commands mentioned above, their corresponding characters and descriptions are as follows:

Command charactermeaning
>Increase the value of the data pointer by 1 (that is, move to the right to point to the unit to the right of the current unit)
<Decrease the value of the data pointer by 1 (that is, move to the left to point to the cell to the left of the current cell)
+Increase the byte value in the unit currently pointed to by the data pointer by 1
-Decrease the byte value in the unit currently pointed to by the data pointer by 1
.Output the byte value stored in the unit pointed to by the current data pointer
,Accept an input byte value and store it in the unit pointed to by the current data pointer
[If the value of the byte in the unit pointed to by the current data pointer is 0, the execution flow jumps to the instruction after the subsequent "]" instruction paired with it
]If the byte value in the unit pointed to by the current data pointer is not 0, the execution flow will fall back to the previous instruction after the matched "[" instruction

In order to deepen our understanding, we can give a simple example, such as the following Brainfuck code.

++ [ - > + < ] Copy code

This code first increases the value in the first cell of the paper tape twice in a row (

++
), which becomes 2. Subsequently,
[
The instruction checks that the value in the current cell is not 0 (2), so it continues to execute the next instruction. The next four instructions
->+<
It will first decrement the value in the current cell by one, then move the data pointer to the right to the second cell, then increase the value in the cell by one, and then return to the first cell, and so on . Straight to the last
]
When the instruction determines that the value in the first cell is 0, the program ends. Therefore, we can know that this program is mainly used to replace the values in the first two cells of the paper tape. More, you can also use Brainfuck Visualizer to view the complete dynamic execution process of the above program.

JIT Compilation

After understanding our target language, let us take a look at what exactly is the JIT Compilation technology? I believe that whether you are working on front-end, back-end, or mobile, you must have heard of the term "JIT". The full name of JIT Compilation is " Just-In-Time Compilation ", which translates to " Just-In-Time Compilation ". Its most notable feature is that the compilation process of the code occurs during the execution of the program, rather than before the execution . Usually in the field of compilation technology, we will compare the two methods of JIT and AOT. I believe everyone is familiar with AOT compilation. Common examples include : using Clang to compile C/C++ code, using Babel to compile ES6 code, and even compiling JavaScript code into IR (Intermediate Representation) dedicated to a JS engine, etc. The process can be regarded as a specific type of AOT compilation. The biggest difference between JIT and AOT is "the point in time when the compilation process occurs ". For JIT, the compilation process occurs at the runtime of the program; for AOT, the compilation process occurs before the program is executed ( Usually at build time).

Before the traditional JIT compiler actually dynamically generates the machine code, it will first perform a series of profiling on the original code or its corresponding IR intermediate code . Through these analysis processes, the compiler can find the "critical code path" that can be optimized for performance through JIT compilation. The key point of the trade-off here is that the performance gains obtained by running-time optimization of these codes need to be higher than the performance overhead incurred during optimization . We will see in a later article that for our instance, these overheads mainly come from the runtime compilation of the code and the process of OSR (On-Stack Replacement). In order to facilitate understanding, in the subsequent examples of this article, we will not implement the code pre-analysis process performed by the traditional JIT.

Also note that, in general JIT compiler Taking into account the " start-up delay ( the Startup Time Delay issue)," so usually in conjunction with the use of an interpreter. The finer the code analysis process performed by the JIT compiler and the more optimizations it implements, the higher the quality of its dynamically generated machine code, but the subsequent initial code execution delay will also be greater. The addition of the interpreter can make the execution of the code proceed in advance. During this period, the JIT compiler will also analyze and optimize the code at the same time, and at a specific moment, the program execution process will be converted from interpretation execution to execution of its dynamically generated optimized machine code. Therefore, for the JIT Compilation technology itself, one of the key points that need to be traded off in its implementation is: to make a trade-off between the compilation time and the quality of the code that you want to generate . For example, the JVM has two JIT modes that can be selected- client and server . The former will use the smallest compilation and optimization options to minimize startup delay; the latter will use the maximum compilation and optimization strategy, while sacrificing The start time of the program.

Implementation details

Ok, it's time to showcase :) . First of all, the JIT Compiler we implemented for Brainfuck language is only used as the POC of the content of this article, and did not consider the completeness of the production version, such as: exception-handling , thread-safe , profiling , assembly fine-tuning, etc. Secondly, the implementation details that will be introduced next will focus on the function bfJITCompile , function allocateExecMem , and the three parts of the class VM in the source code. It is recommended to browse the source code by yourself before continuing to read.

As we said above, the code compilation process of JIT Compilation occurs at the runtime of the program, so you can also see from the source code that we pass the different parameters provided by the user when running the interpreter program ( --jit ) To decide whether to use JIT compilation and execution, or direct interpretation and execution. For the "JIT compilation and execution" approach, the process can be roughly summarized as follows:

  • Read in the source code (including the instruction sequence in ASCII form);
  • Call the bfJITCompile function to compile the source code into machine code ;
  • Call the allocateExecMem function to dynamically allocate the machine code on the executable memory segment ;
  • Call the VM::exec function to transfer the execution flow through OSR ;
  • After the code is executed, transfer back to the main process again;
  • Perform some clean-up work.

Next, we will focus on the specific implementation details of the second, third, and fourth items in the above process.

Compile machine code

In this step, we will "extract" all the instruction characters in the Brainfuck source code entered when the program is started, and directly generate the corresponding machine code version of the binary code for it in sequence. These generated binary machine code sets will be stored in a

std::vector
Object for future use. In order to simplify the machine code generation process, we simply pass
switch
The sentence identifies the character corresponding to the instruction and "returns" the X86-64 binary machine code corresponding to the instruction. And these returned machine codes will also be directly "spliced" into the Vector container used to store the machine code collection.

It should be noted here that for these returned binary machine codes, since they may contain referenced relative address information (RIP-Relative), before being actually stored in the Vector container, we also need to use methods such as _relocateAddrOfPrintFunc to correct these Binary machine code performs " address relocation " processing. Through these methods, we can accurately calculate the actual information of these relative addresses and correct them.

First of all, we can find the following piece of code in the definition of the bfJITCompile function. Through this code, we store the address of the "data pointer" in the Brainfuck execution model in the register rbx , so that we can subsequently modify or use the value in the register to control the position of the data pointer, or read, Modify the contents of the paper tape cell pointed to by the current data pointer. The comment "/* mem slot */" in the code here means that the content at the location of the comment will be replaced with the actual referenced memory address at compile time. And this address will come from

bfState::ptr
The value of is the little-endian format address returned after processing by the function _resolvePtrAddr .

//prologue. std::vector< uint8_t > machineCode { //save dynamic pointer in %rbx. 0x48 , 0xbb , /* mem slot */ }; //... Copy the code

Next, as the instruction characters are continuously read in, bfJITCompile function can sequentially convert these instructions to their corresponding machine code versions. For the four instructions " +-> < ", their corresponding machine instructions only need to manipulate the address value of the data pointer we previously stored in the rbx register to complete the state change of the Brainfuck abstract machine. For example, taking the "+" instruction as an example, we can find the following code:

//... case '+' : { for (n = 0 ; *tok == '+' ; ++n, ++tok); const auto ptrBytes = _resolvePtrAddr(ptrAddr); std::vector< uint8_t > byteCode { 0x80 , 0x3 , static_cast < uint8_t >(n), //addb $0x1, (%rbx) }; _appendBytecode(byteCode, machineCode); --tok; break ; } //... Copy the code

In this code, we first use an optimization strategy that is easy to think of, that is, when encountering consecutive "+" instructions, compared to generating the same and repeated machine for each "+" instruction that appears Code, we can choose to first count the number of consecutive "+" instructions encountered, and then pass a separate assembly instruction

addb $N, (%rbx)
To complete the state changes generated by these multiple "+" instructions at one time. The same method can also be applied to the remaining three instructions, which respectively correspond to the change of the value in the cell pointed to by the data pointer and the change of the value of the data pointer itself.

As for the "," and "." instructions, since they involve IO operations, the corresponding machine code here will involve the calling process of the operating system call ( System Call ). Operating system calls need to follow a specific calling convention ( Calling Convention ). For example, generally speaking, the register rax is used to store the system call number, rdi is used to store the first parameter, rsi is used to store the second parameter, and rdx is used Store the third parameter and so on. At the same time, the system call numbers under macOS and Linux operating systems are also different. Here we use special macros to distinguish them.

//... case ',' : { /** movl $0x2000003, %eax movl $0x0, %edi movq %rbx, %rsi movl $0x1, %edx syscall */ std::vector< uint8_t > byteCode { # if __APPLE__ 0xb8 , 0x3 , 0x0 , 0x0 , 0x2 , # elif __linux__ 0xB8 , 0x0 , 0x0 , 0x0 , 0x0 , # endif 0xBF , 0x0 , 0x0 , 0x0 , 0x0 , 0x48 , 0x89 , 0xde , 0xba , 0x1, 0x0 , 0x0 , 0x0 , 0xf , 0x5 , }; _appendBytecode(byteCode, machineCode); break ; } //... Copy the code

Finally, for the "[" and "]" instructions, the implementation logic will be a bit more complicated. Take the "[" instruction as an example, as shown in the following code. Here, it is very simple to directly map the semantic logic of the "[" instruction to the assembly code. The logic is to determine whether the value of the cell pointed to by the current data pointer is 0. If it is 0, the execution flow jumps to the instruction following the matched "]" instruction; otherwise, it continues to execute the next instruction . Therefore, we directly use the cmpb assembly instruction to determine whether the value of the corresponding memory location is 0 when the value in the register rbx is used as the address. If it is 0, use the je assembly instruction to jump to the subsequent instruction position, otherwise directly execute the next instruction. The use of the "follow-up instruction address" in the code will redirect it in the processing flow of the "]" instruction paired with it. Therefore, here we will use the 0x0 value of four consecutive bytes as a placeholder. Another thing to know is that, in order to simplify the implementation, we will always use the " near jump " mode here.

//... case '[' : { /* cmpb $0x0, (%rbx) je <> */ std::vector< uint8_t > byteCode { 0x80 , 0x3b , 0x0 , 0xf , 0x84 , 0x0 , 0x0 , 0x0 , 0x0 , /* near jmp */ }; //record the jump relocation pos. _appendBytecode(byteCode, machineCode); jmpLocIndex. push_back (machineCode. size ()); break ; } //... Copy the code

At this point, we have completed the dynamic compilation of machine instructions. Through this stage, our program can convert the input Brainfuck instruction character sequence into the corresponding platform-related binary machine code. At the end of the bfJITCompile function, you can see the following piece of code for closing . This code is mainly used to output the contents of the stdout buffer before the program exits and reset the value of the rip register to return the program execution flow to the C++ code. We will review this part of the content later.

//epilogue. //mainly restoring the previous pc, flushing the stdout buffer. /** cmpq $0, %r11 je 8 callq <print> jmpq *(%rsp) */ std::vector< uint8_t > byteCode { 0x49 , 0x83 , 0xfb , 0x0 , 0x74 , 0x8 , 0xe8 , /* mem slot */ 0xff , 0x24 , 0x24 , }; Copy code

Executable memory allocation

Next, our focus will shift from "how to dynamically generate machine code" to "how to dynamically execute machine code". For this part of the implementation, you can refer to the function named allocateExecMem , as shown in the following code.

//... uint8_t * allocateExecMem ( size_t size) { return static_cast < uint8_t *>( mmap ( NULL , size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1 , 0 )); } //... Copy the code

In the definition of this function, we call the name

mmap
This function is the key to supporting "dynamic execution of machine code".
mmap
The function is a system call provided by the C standard library. Through this function, we can create a mapping in the VAS (Virtual Address Space) of the current process. This mapping can point to a specific file or an anonymous space. on
mmap
Function, one of the most well-known uses is that when a virtual page is allocated to the target file, the operating system uses this function to point the page table entry to the appropriate location in the target file. Here, we need to use this function to create an "anonymous space" that does not point to any actual files, and continuously put the binary machine code we compiled in the previous step into this memory space.

Not only that, through

mmap
The third parameter of the function specifies the PROT_EXEC attribute, and we can mark the anonymous memory space requested as " executable ". This means that the machine instructions stored in this memory space can be executed by the CPU. For detailed configuration information about other parameters of this function, you can refer to here to find more answers. allocateExecMemThe actual call process of function is placed in
VM
In the constructor of the class, here we simply encapsulate the allocation and destruction of resources through RAII.

OSR (On-Stack Replacement)

After the binary machine code generated by the compilation is put into the executable anonymous memory segment, the next key point is: *How to transfer the program's instruction execution flow to the starting position of this memory? *For this part of implementation, we need to use the "C++ inline assembly" function provided by the Clang/GCC compiler. You can

VM::exec
Find the answer in the implementation of the function. This code is as follows:

//... void exec () { //save the current %rip on stack (by PC-relative). //%r10-stdout buffer entry. //%r11-stdout buffer counter. asm ( R"( pushq %%rax pushq %%rbx pushq %%r10 pushq %%r11 pushq %%r12 movq %1, %%r10 xorq %%r11, %%r11 lea 0xe(%%rip), %%rax pushq %%rax movq %0, %%rax addq %2, %%rax jmpq *%%rax )" :: "m" (mem), "m" (stdoutBuf), "m" (prependStaticSize)); //clean the stack. asm ( R"( addq $8, %rsp popq %r12 popq %r11 popq %r10 popq %rbx popq %rax )" ); } //... Copy the code

In this code, we used twice

asm
Assembly instructions. Among them, the purpose of the first inline assembly is to transfer the execution flow of the program to the machine code generated by our previous dynamic compilation. Here are the first 5 rows
push
The calling process of instructions is mainly used to store the values in these registers on the stack to protect the state of the registers at the moment. These values will be restored after the execution flow of the program returns to the C++ code. Line 6
movq
The instruction stores the first address of the stdout buffer in register r10 . This buffer will be used to buffer the character content output by the "." instruction to reduce the actual number of system calls and improve performance . In the following lines 8-9, we store the address of the next instruction of the normal C++ code execution flow on the stack so that we can return normally from the dynamic execution flow. On lines 10-11, we correctly set the address of the anonymous executable memory segment and the corresponding offset (across the static subroutine definition part). The last line, through
jmpq
Instruction, we let the execution flow of the CPU jump to the location where the value in the rax register is used as the memory address, that is, the location that contains the first dynamic instruction we will execute.

At this point, the execution transfer process from C++ code to dynamic instruction is complete. When all the dynamically generated instructions are executed, we need to transfer the execution flow back to normal C++ code in a similar way. Remember the small piece of " epilogue " assembly code we mentioned at the end of the "Compiling Machine Code" section ? If you go back and look at it, you will find that in the last instruction of this code, we used

jmpq *(%rsp)
Instruction, this instruction will transfer the execution flow of the CPU to the memory location with the qword value stored at the bottom of the current process stack as the address. And this value is the return address of the C++ code we stored in the previous step. When the execution flow returns to the C++ code, we encounter the second
asm
Assembly instructions. Through this instruction, we can clean up the contents of the stack and restore the state of related registers at the same time. At this point, the execution flow of the program is basically over.

Let us turn our attention back to the topic "OSR" in this section. The full name of OSR is "On-Stack Replacement". With the help of Google, we can find a definition of it, as follows:

On-stack-replacement (OSR) describes the ability to replace currently executing code with a different version, either a more optimized one (tiered execution) or a more general one (deoptimization to undo speculative optimization).

In fact, we can simply understand OSR as "the process of transition from one execution environment to another ." For example, in our example,

VM::exec
When the function is executed, it will transfer the execution environment from the C++ code to the dynamically generated machine code, and finally transfer it back in the same way. And this kind of execution environment conversion can be regarded as the process of OSR. The figure below is a visual display of the above OSR process.

Benchmark

So far, we have introduced the implementation details of several key points of Brainfuck JIT Compiler. So now let's take a look at the performance of this rough version of the JIT compiler? Two sets of tests are provided in the source code of the project, which are used to test the two scenarios of " IO-intensive " and " Compute-intensive ". A set of test results are as follows:

  • IO-intensive case :
Benchmark for 10 seconds: (higher score is better) 12950 interpreter 35928 jit (win) Copy code
  • Computationally intensive case :
Benchmark Result: (lower time is better) 13.018s interpreter 0.885s jit (win) Copy code

As you can see, the overall result is pretty good. For IO-intensive test cases, JIT Compilation can bring nearly 3 times the performance improvement compared to pure Interpretation . For computationally intensive scenarios, the performance improvement brought by JIT is considerable. In our " Printing Mandelbrot Collection " test case, using JIT Compilation can bring about a 15-fold performance improvement compared to Interpretation . Of course, since we have not adopted a more complete test set and test plan, the results of these test cases are for reference only.

More information

Next, we will discuss some additional issues appropriately. Of course, each of these topics can be expanded to form a complete article, so here is only an extension. For more information, please Google by yourself.

The Death of Interpretation

It can be said that " branch-misprediction " is one of the most important reasons for the slow operation of the interpreter. For example, the one we implemented in the source code of the example in this article is based on

switch
The interpreter of the statement. This interpreter model reads one character instruction in the input source file each time, and then changes the current state of the interpreter according to the content of the instruction (such as data pointer, paper tape content, etc.). The problem with this approach is that from a macro point of view, when the CPU actually executes the switch statement, it cannot know what the possible symbolic instructions will be entered next time, and this will cause the "PC branch prediction" to fail. From a micro point of view, unpredictability or prediction failure will lead to a waste of CPU clock cycles (need to wait for the result, or discard the wrong prediction value and cause the pipeline to refill, etc.). Therefore, the instruction delay caused by "pipeline correlation" will also become prominent after a large number of instructions are executed.

For interpreter models such as " Direct Threading " and " Subroutine Threading ", although they can better solve the problem of branch prediction failures, they come with such as: using too much

jmp
Issues such as instructions, useless stack frames (no inline ), etc. will also greatly reduce the performance of the interpreter when interpreting the program. In contrast, JIT Compilation can easily avoid these problems through basic optimization strategies such as dynamic machine code generation and inlining compilation. Not only that, some JIT compilers can even achieve higher runtime performance improvements than the AOT method. This is mainly due to the fact that JIT can perform a more dynamic and heuristic code optimization process based on the current operating system type, CPU ISA system, and code profiling results when the code is running.

JIT implementation strategies and methods

Common JIT strategies can be divided into several categories: Method-based JIT , Trace-based JIT and Region-based JIT . Among them, Method-based JIT uses "function" as an independent compilation unit. The compiler will identify hot functions during code execution (for example, based on the number of times the function is called), and then replace it with the compiled machine code version . This method is relatively simple to implement but also has corresponding problems. For example, the JIT granularity is relatively coarse, the hit rate of hot code is low, and the time-consuming logic (such as "loop") in the function body cannot be accurately captured.

In contrast, Trace-based JIT uses "Trace" as the compilation unit of hot code. A trace refers to a section of hot code path executed by the program when it is running. From the source code point of view, the execution path of these hot codes may span multiple functions. What the JIT compiler has to do is to optimize the runtime compilation of the hot code on this path.

The last Region-based JIT uses "Tracelet" as its compilation unit. This JIT solution is mainly implemented by Facebook's HHVM virtual machine. A Tracelet usually refers to the longest execution path that can be "typed specialization". For more information, please refer to this paper (the author doesn't understand it, so I won't start to talk about it).

In addition to the above three common JIT compiler implementation strategies, for implementation details, compared to the "human flesh machine code compilation" process we used in this article, we usually choose to use some compilation frameworks to provide better machine code Picking and compiling functions. Commonly used frameworks such as: DynASM , LLVM and Cranelift, etc. These frameworks usually provide not only basic and platform-specific machine code compilation functions, but also corresponding code optimization functions. For example, for the Method-based JIT strategy, usually some optimization strategies that can be used for static AOT compilation can also be used directly by the JIT compiler, and by using such as LLVM, we can directly use these very mature optimizations more simply Strategy to avoid the trouble of repeated implementation.

Reference

  1. solarianprogrammer.com/2018/01/10/ ....
  2. solarianprogrammer.com/2018/01/12/ ....
  3. corsix.github.io/dynasm-doc .
  4. en.wikipedia.org/wiki/Just-i ....
  5. en.wikipedia.org/wiki/Tracin ....
  6. en.wikipedia.org/wiki/Brainf ....
  7. gcc.gnu.org/onlinedocs/ ....
  8. github.com/opensource-... .
  9. en.wikipedia.org/wiki/Ahead-... .
  10. prl.ccs.neu.edu/blog/2019/0 ....
  11. javapapers.com/core-java/j ....