CN116661807A

CN116661807A - Binary translation code inline optimization method, binary translation code inline optimization device and storage medium

Info

Publication number: CN116661807A
Application number: CN202310549871.8A
Authority: CN
Inventors: 林媛; 谢汶兵; 罗巧玲; 黄隽祎; 田雪; 李欣
Original assignee: Wuxi Advanced Technology Research Institute
Current assignee: Wuxi Advanced Technology Research Institute
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-08-29

Abstract

The invention discloses a binary translation code inline optimization method, a binary translation code inline optimization device and a storage medium, wherein the binary translation code inline optimization method comprises the following steps: determining a target optimization helper function according to binary translation performance bottleneck analysis, and carrying out semantic analysis on the target optimization helper function to obtain a semantic analysis result; based on semantic analysis results, a translator back-end translation module is adopted to generate a static binary code, and the static binary code is equivalent to the target optimization helper function in function; storing the static binary codes into a code cache area to obtain an inline code segment; the optimization objective optimizes the invocation of the helper function. The invention realizes the function of the original helper function through the replacement of the inline code segment, thereby obviously reducing the context switching introduced by the function call, reducing the damage to the code locality and effectively improving the performance of the dynamic binary translation system.

Description

Binary translation code inline optimization method, binary translation code inline optimization device and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a binary translation code inline optimization method, apparatus, and storage medium.

Background

Dynamic binary translation essentially converts a set of client instructions into a set of native instructions while preserving the semantics of the client instructions. However, there is inevitably a substantial semantic difference in the instructions of different machine architectures. For client instructions with extremely complex semantics and functions, such as floating point operations, SIMD, eflags simulation, indirect jump lookups, etc., the fast emulator (QuickEMUlator, QEMU) relies on the helper functions to implement the functions of the complex instructions in the high-level program C/C++ language. The helper function effectively reduces the development and maintenance difficulty of the dynamic binary translation system, but simultaneously, the helper function also generates a great amount of performance overhead, which is a key bottleneck point of the binary translation performance overhead.

The performance overhead of the helper function is mainly derived from: (1) Invoking the helper function requires the execution of additional pass instructions and function jump instructions. (2) Because the helper function generated by native compilation and the binary code generated by translation are located in different memory areas of the virtual address space of the dynamic binary translation system, invoking the helper function inevitably destroys the locality of the instruction cache. (3) A partial helper function call may introduce a context switch instruction that causes further expansion of the translation generation code.

The helper_lookup_tb_ptr, a type of helper function, is widely used in indirect jump target address lookup where the context switch instruction has a non-negligible impact on translation performance.

Because the target jump address of the indirect jump instruction is uncertain, such as a return instruction, the QEMU calls the helper_lookup_tb_ptr to realize the search function, and the switching of the program control flow between a translation thread and an execution thread is effectively reduced. However, invoking the function introduces context switch instructions and destroys code locality storage, causing further expansion of the translation generation code, reducing the performance of the dynamic binary translation system.

Disclosure of Invention

The invention provides a binary translation code inline optimization method, a binary translation code inline optimization device and a binary translation code inline optimization storage medium, which are used for solving at least one technical problem.

In a first aspect, the present invention provides a binary translation code inline optimization method, including:

determining a target optimization helper function according to binary translation performance bottleneck analysis, and carrying out semantic analysis on the target optimization helper function to obtain a semantic analysis result;

based on the semantic analysis result, a translator back-end translation module is adopted to generate a static binary code, and the static binary code is equivalent to the target optimization helper function in function;

storing the static binary codes into a code cache area to obtain an inline code segment;

optimizing the calling process of the target optimization helper function.

According to the technical scheme, the semantic analysis result of the target optimized helper function is translated into the binary inlined code segment, the binary inlined code segment is stored in the code buffer (namely codecache), and the calling process of the target optimized helper function is optimized, so that the inlined code segment is replaced to realize the function of the original helper function, the context switching introduced by function calling is obviously reduced, the damage to the locality of codes is reduced, and the performance of the dynamic binary translation system is effectively improved.

Optionally, the target optimized helper function is helper_logo_tb_ptr.

Optionally, optimizing the call procedure of the target optimization helper function further comprises:

before a call translation parameter transferring instruction and a function jump instruction of the target optimized helper function are added, a jump instruction is overturned, and the destination address of the jump instruction is the starting address of the inline code segment in a code cache region.

Optionally, the method further comprises:

and adding an optimization flag bit for the selected target optimization helper function.

Optionally, the method further comprises:

determining whether the optimized helper function is a target optimized helper function or not according to the optimized flag bit;

if the helper function is optimized for the target, optimizing the calling process of the target optimized helper function.

In a second aspect, the present invention also provides a binary translation code inline optimization apparatus, including:

the determining module is used for determining a target optimization helper function according to binary translation performance bottleneck analysis, and carrying out semantic analysis on the target optimization helper function to obtain a semantic analysis result;

the generation module is used for generating a static binary code by adopting a translator back-end translation module based on the semantic analysis result, wherein the static binary code is functionally equivalent to the target optimization helper function;

the storage module is used for storing the static binary codes to the codemechanism to obtain an inline code segment;

and the optimizing module is used for optimizing the calling process of the target optimized helper function.

In a third aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a binary translation code inline optimization method as described in the first aspect above.

Compared with the prior art, the invention has the beneficial effects that:

the invention generates the semantic analysis result of the target optimized helper function into the static binary code, stores the static binary code into the code cache region to obtain the inline code segment, optimizes the calling process of the target optimized helper function, and enables the inline code segment to replace and realize the function of the original helper function, thereby obviously reducing the context switching introduced by function calling, reducing the damage to the locality of the code and effectively improving the performance of the dynamic binary translation system.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a binary translation code inline optimization method of the present invention;

FIG. 2 is a functional schematic of the helper_lookup_tb_ptr function of the present invention;

FIG. 3 is a flow chart of the present invention's helper_lookup_tb_ptr function call;

FIG. 4 is a schematic diagram of the binary translation code inline optimization of the present invention;

FIG. 5 is a virtual address space distribution of QEMU in Shenwei platform according to the present invention;

FIG. 6 is a schematic diagram of a binary translation code inline optimizing apparatus of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the binary translation code inline optimization method includes the steps of:

and 100, determining a target optimization helper function according to binary translation performance bottleneck analysis, and carrying out semantic analysis on the target optimization helper function to obtain a semantic analysis result.

And step 101, based on semantic analysis results, generating a static binary code by adopting a translator back-end translation module, wherein the static binary code is equivalent to the target optimization helper function.

And 102, storing the static binary codes into a code cache area to obtain an inline code segment.

Step 103, optimizing a calling process of the target optimization helper function.

In particular, QEMU is open source binary translation software that relies on helper functions to implement the functionality of complex instructions in a high-level programming language. The target optimized helper function is one of the helper functions that can bring about a significant improvement in the performance of the translator after optimization. Semantic analysis refers to that a developer converts a target optimized helper function written in a high-level language (for example, a C language) into assembly code according to rules, wherein the assembly code is a semantic analysis result. And generating a static binary code by adopting a translation module at the rear end of the translator according to the semantic analysis result, and storing the static binary code in a code cache area to obtain an inline code segment. The function realized by the section of the inline code is equivalent to the function realized by the target optimization helper function. The code buffer is a codemechanism in the QEMU, and is a section of memory space used by the QEMU to store the binary code generated by translation.

The process of generating the static binary code by the translation module at the back end of the translator based on the semantic analysis result is exemplified as follows, and the description is given only by taking the Shenwei platform as an example, and the invention is not limited by the invention.

QEMU implements the translation process of target_code-host_code based on TCG module. Such as:

target_code：movrax，rcx。

this is a mov instruction of the x86 architecture, whose semantics reside in saving the value in register rcx into register rax.

The QEMU back-end translation module calls a back-end translation function according to the instruction semantics:

tcg_out_insn_simpleReg(s，OPC_BIS，rd，rn，TCG_REG_ZERO)；

this function generates host_code: movr10, r11;

wherein r10 and r11 are Shenwei platform registers, and correspond to x86 platform registers rax and rcx respectively. Host_code generated by the translator is stored in codemechanism.

And optimizing the calling process of the target optimized helper function so that the original helper function is realized by replacing the inline code segment.

According to the technical scheme, the semantic analysis result of the target optimized helper function is translated into the binary inlined code segment and is stored in the codemechanism, and the calling process of the target optimized helper function is optimized, so that the inlined code segment replaces the original helper function, the context switching introduced by function calling is obviously reduced, the damage to the code locality is reduced, and the performance of the dynamic binary translation system is effectively improved.

Optionally, the target optimization function is helper_lookup_tb_ptr.

Specifically, the helper_lookup_tb_ptr is a special case of the helper function, and does not implement the function of the client instruction, but is often called when an indirect jump instruction (the jump instruction with one or more target jump addresses is an indirect jump instruction, such as a return instruction of x 86) that needs to be translated is encountered, and is used to find and determine the target address (the meaning of determining an address from a plurality of target addresses) of the final jump. Indirect jump instructions are widely used in client applications, so when QEMU binary translates indirect jump instructions, the helper_lookup_tb_ptr is frequently invoked, and therefore its context switch instruction has a non-negligible impact on translation performance.

As shown in FIG. 2, the QEMU uses the helper_lookup_tb_ptr function to split the cache lookup into two steps, fast lookup and slow lookup. The fast lookup is a query from a small cache array that maintains translation hot code that is executed more frequently, and generates an array index based on the source PC value during the query. If the search fails, namely, the search is missed, then a slow search stage is entered, and if the search is hit, the next translation block is executed; slow lookup is based on a hash table of complex structure to retrieve information of all translation blocks stored in codemechanism, and the query process takes more time. If the slow search is successful, i.e. hit, the array is updated, i.e. the query result is updated to the cache array corresponding to the fast search, and then the next translation block is executed. If the search fails, namely, the search is not hit, a translation clue is returned, which indicates that the translation block is not translated yet, and the control flow is switched to the translator to expand the translation.

Research has found that in most applications, the hit rate of quick searches exceeds 99%.

Therefore, the context switching instruction can be greatly reduced and the overhead of a program can be reduced by selecting the helper_lookup_tb_ptr as a target to optimize the helper function.

Specifically, after the target optimization helper function is determined, the calling process of the helper function is improved, and a back-end translation module of the translator translates a jump instruction before translating a parameter transmission instruction and a function jump instruction for the calling of the target optimization helper function, wherein the destination address of the jump instruction is the starting address of an inline code segment in a codec, so that the inline code segment replaces the function of the original helper function.

For example, in fig. 3 and fig. 4, the optimization method of the calling process of the target optimization helper function is illustrated by taking the target optimization helper function helper_lookup_tb_ptr as an example, and the description is only made by using the shenwei platform as an example, which is not a limitation of the protection scope of the present invention.

As shown in fig. 3, the translator process for the helper_lookup_tb_ptr includes:

the QEMU jumps to the helper_lookup_tb_ptr function generated by the native compilation at the execution clue through a function jump instruction to search the entry address of the next translation block. As can be seen from FIG. 3, the indirect jump instruction of the client x86_64 platform:

jmp*％rax；

the target jump address of the instruction is stored in a register RAX, and the actual jump address can be determined only when the instruction is dynamically running. When the x86-to-sw64 translation is realized, the translator translates jmp instruction semantics, firstly stores an address in a virtual register of RAX, namely 0 ($r9), into env- > eip, namely 128 ($r9), translates a parameter transmission instruction and a function jump instruction, and calls a helper_lookup_tb_ptr function. The function is mainly to find the entry address of the next translation block by env- > eip, i.e. the source PC value.

As shown in fig. 4, the translator translates the fast search function in the helper_lookup_tb_ptr function into a static binary code segment in advance, and stores the static binary code segment in the codemechanism to obtain an inline code segment. Compared with fig. 3, the invention adds a jump instruction before translating the parameter transfer instruction and the function address jump instruction for calling the helper_lookup_tb_ptr function, the target address of the instruction is tb_jmp_cache_addr, which is the initial address in the codec of the static binary code of the pre-translated fast search function, and when the target address of the indirect jump instruction needs to be searched, the quick search of the inline code segment is called first. If the lookup misses, the original helper_lookup_tb_ptr function is continuously called.

Because the jump instruction is executed before the parameter transmission instruction, the inline code segment does not need frequent context switching, and the binary translation performance overhead is greatly reduced.

Optionally, the method further comprises:

Specifically, the QEMU translator back end realizes the call translation of the helper function in the tcg _reg_alloc_call function, and the call translation flow of all the helper functions is consistent. In order to distinguish the helper_lookup_tb_ptr from other helper functions, when the intermediate code is generated at the front end of the translator, an optimization flag bit is added for the intermediate code for calling the helper_lookup_tb_ptr function, and in the process of back-end translation, whether the optimization flag bit exists in the intermediate code called by the helper function is judged to confirm.

Optionally, the method further comprises:

Specifically, in the translator front-end translation, an optimization flag bit is added for the helper_lookup_tb_ptr function. In the back-end translation of the translator, judging whether the helper function is a target optimized helper function according to the optimized flag bit. If the helper function is optimized for the target, a jump instruction is added before a parameter transferring and function jump instruction is transferred for the call of the helper function, and the target address of the instruction is the starting address of a static binary code generated by a translator back-end translation module in a codecache, so that an inline code segment is replaced to realize the function of the original helper function.

The function with large influence on the performance of the translation system is selected to be optimized, so that the performance cost of the translation system can be reduced to a large extent.

The method provided by each of the above embodiments of the present invention is exemplified by the following specific examples. The present example is divided into the following steps according to fig. 4:

s10: and determining the memory location saved by the inline code segment, and recording the starting address of the memory space of the segment.

The present embodiment saves the inline code segment in codemechanism. A global variable tb_jmp_cache_addr is defined for recording the start address of the inline code in codeache.

S20: before a call translation parameter transfer function and a function jump instruction for the upper_lookup_tb_ptr function, a jump instruction with an entry address of tb_jmp_cache_addr is added.

In this embodiment, the fast seek part of the helper_look up_tb_ptr function is inlined.

The fast lookup algorithm in the helper_lookup_tb_ptr is shown in the following algorithm 1, the hash is a cache array index obtained based on the source PC value, the cpu- > tb_jmp_cache is a cache array maintaining the translation hot code with higher execution frequency, and tb is the next translation block.

Semantic analysis is carried out on the quick search function of the helper_lookup_tb_ptr function, a semantic analysis result is obtained, the semantic analysis result is translated into a static binary code segment in advance through a translation module at the rear end of the translator, and the static binary code segment is stored in the codecache, so that an inline code segment is obtained. Next, the helper_lookup_tb_ptr function call procedure is optimized, and a jump instruction with an entry address of tb_jmp_cache_addr is added before the parameter transmission instruction of the function, wherein:

s21: an optimization flag bit is added for the helper_lookup_tb_ptr function.

In order to distinguish the helper_lookup_tb_ptr from other helper functions, when the intermediate code is generated at the front end of the translator, an optimization flag bit is added for the intermediate code for calling the helper_lookup_tb_ptr function, and in the process of back-end translation, whether the optimization flag bit exists in the intermediate code called by the helper function is judged to confirm.

S22: a jump instruction is added with an entry address of tb jmp _ cache _ addr.

A jmptb jmp cache addr is added before the translation calls the helper _ lookup _ tb _ ptr function. When executing, the method jumps to the tb_jmp_cache_addr position in the codemechanism, namely executes the inline code segment in the embodiment, and performs quick search based on the PC. If the search of the inline code segment is not hit, the original helper_lookup_tb_ptr function is continuously called through return instruction, and the subsequent slow search is completed.

S30: based on the back end of the translator, the quick search function of the helper_lookup_tb_ptr is translated into an inline code segment and stored in the codemechanism.

Based on the translation function at the rear end of the translator, the functions of acquiring the cache array index, looking up the table, judging whether the lookup hits or not and the like in the quick lookup algorithm are effectively realized. If the search hits, the next translation block is directly jumped and executed, and the context switching instruction is effectively reduced. If the search is not hit, the original helper function is executed. For most application programs, the hit rate of fast search during translation is up to more than 99%, and in most cases, the cost of context switching caused by the original helper function call can be avoided.

S40: and deleting redundant codes related to quick searching in the helper_lookup_tb_ptr, and avoiding repeated searching.

The quick search function is realized in the inline code segment, so that the helper function is simplified to avoid repeated search, and only the realization of the slowly searched helper function and the related array updating function are reserved.

As shown in fig. 5, the codemechanism is used to save the native binary code generated by the translation, and the helper function and the source code of the QEMU translator are compiled at the same time so that they are located in the code segment of the QEMU, and the inline code segment is saved in the codemechanism.

The binary translation code inline optimizing device provided by the invention is described below, and the binary translation code inline optimizing device described below and the binary translation code inline optimizing method described above can be referred to correspondingly.

As shown in fig. 6, the apparatus includes:

the determining module 600 is configured to determine a target optimized helper function according to binary translation performance bottleneck analysis, and perform semantic analysis on the target optimized helper function to obtain a semantic analysis result;

the generating module 610 is configured to generate a static binary code by using a translator back-end translation module based on the semantic analysis result, where the static binary code is functionally equivalent to the target optimization helper function;

a saving module 620, configured to save the static binary code to codemechanism to obtain an inline code segment;

an optimizing module 630, configured to optimize a calling procedure of the target optimization helper function.

Optionally, the target optimization function is helper_lookup_tb_ptr.

Optionally, the operations further comprise:

and adding an optimization flag bit for the selected optimization function.

Optionally, the operations further comprise:

In yet another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the binary translation code inline optimization method provided by the methods described above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A binary translation code inline optimization method, comprising:

optimizing the calling process of the target optimization helper function.

2. The method of claim 1, wherein the target optimized helper function is helper_lookup_tb_ptr.

3. The method for optimizing binary translation code inlining of claim 2, wherein optimizing the call process of the target optimized helper function further comprises:

before a call translation parameter transferring instruction and a function jump instruction of the target optimized helper function are added, a jump instruction is overturned, and the destination address of the jump instruction is the starting address of the inline code segment in the code cache region.

4. The method of binary translation code inline optimization of claim 1, further comprising:

5. The method for inline optimization of binary translation code according to claim 4, further comprising:

6. A binary translation code inline optimization apparatus, comprising:

the storage module is used for storing the static binary codes to the code cache area to obtain an inline code segment;

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the binary translation code inline optimization method according to any one of claims 1 to 5.