CN114003868A

CN114003868A - Method for processing software code and electronic equipment

Info

Publication number: CN114003868A
Application number: CN202111300044.2A
Authority: CN
Inventors: 王喆; 武成岗; 张培华; 曾凯
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-01

Abstract

The embodiment of the invention provides a method for processing a software code and electronic equipment, wherein the method comprises the steps of constructing a control flow diagram of the software code by acquiring the software code to be processed; according to the control flow graph and preset processing logic, splitting and/or aggregating selected functions in the software codes to change the original structure of the control flow graph, wherein the splitting is to split one function into a plurality of functions and convert the control flow relationship in the function into a call relationship among the functions, and the aggregating is to aggregate a plurality of functions into one function. Therefore, the technical scheme can not seriously affect the program performance because a large number of irrelevant instructions are added as in the existing code obfuscation technology, but the control flow graph and the call relation (which are equivalent to changing the function call graph) in the original software code are changed, so that the code comparison technology based on the control flow graph and/or the function call graph analysis can be effectively resisted under the condition of ensuring the program performance.

Description

Method for processing software code and electronic equipment

Technical Field

The present invention relates to the field of software security, in particular to the field of countermeasure code comparison techniques and inverse analysis, and more particularly to a method and an electronic device for processing software codes.

Background

With the rapid development of computer technology, software has been integrated into various aspects of people's life. The software brings convenience to the life of people and brings potential safety hazards. After a software developer releases software, different users have different concerns. In which, a common user usually only pays attention to its functions and performance, and a malicious user (malicious attacker) threatens computer security and user privacy by exploiting software vulnerabilities, and once the vulnerabilities are exploited, huge economic loss will be caused. In addition, the core algorithm and code logic of the business software, as a kind of intellectual property, are often analyzed by inverse analysts, even illegally reused. Therefore, the importance of software security and intellectual property protection is self evident.

When the source code of the software is not directly available, the function and behavior of the software are often understood through reverse engineering. Reverse engineering is an important means for reverse analysts to find software bugs and analyze core algorithms, and is a tedious and time-consuming process. The success of the reverse analysis depends on the experience of the reverse analyst and expert knowledge, and the difficulty of the reverse analysis increases as the scale of the software to be reversed increases.

Code alignment techniques can quickly find similarities or differences in disassembled code. Security researchers and engineers may analyze the bug fix code in patches using code comparison techniques, or may analyze multiple versions of an executable program and reuse the analysis results to prevent duplicate analysis of the same or similar executable files. But also provides a shortcut for reverse engineering by reverse analysts. For example, given a binary file and a repository containing analyzed and annotated code, a reverse analyst may accelerate the analysis by applying code reuse detection to the binary file to identify the same or similar code in the repository and then focus only on the new functionality or components of the binary file.

The existing code comparison technology is mainly divided into two types, one is based on function basic information, such as text comparison, morpheme comparison, measurement comparison, abstract syntax tree, call graph and control flow graph, execution sequence and the like; among them, Google Bindiff is a binary alignment software based on graph analysis, which is widely used at present. The other method is to apply the machine learning method to the code comparison technology, for example, convert the code sequence after the anti-compilation into word vectors, and analyze the similarity of the codes by using the natural language processing technology.

In order to combat code comparison technologies and protect software security and software intellectual property, the existing code comparison and combat technologies are mainly classified into two categories, one is code encryption technology, also called code shell technology, which uses codes and data in encrypted programs to combat analysis and comparison tools. Specifically, the code encryption technology encrypts code and data of an executable file and inserts corresponding decryption logic, and the decryption logic is preferentially executed after a program is loaded to remove the encryption of the code and the data. Common code shelling tools, such as the UPX Shell tool (UPX Shell), encrypt and compress the executable file and provide a Shell removal program for decompressing decryption at runtime. This shell approach has many disadvantages when working against graph analysis based code alignment techniques. First, it is a stand-alone tool. The user needs to process the published executable file by the UPX shell adding tool before obtaining the protected file, which undoubtedly reduces the user experience. Second, today shelling methods are well established for various types of shelling technologies, e.g., the IDA Pro dis-assembler provides a variety of plug-ins for shelling different shelling technologies. Even if the shelling is failed, the code is decrypted after being loaded into the memory, and if the whole memory image is dumped by using a memory dump tool at the moment, the code is still in a plaintext form, and the code comparison tool can still be used for analyzing.

Another common technique is a code obfuscation technique, which takes an O-LLVM open source tool as an example, and is implemented based on an LLVM framework, and can perform obfuscation on a code of a target program by using different obfuscation means, so as to increase difficulty of inverse analysis. Three different obfuscation modes of O-LLVM are instruction replacement, control flow forgery, and control flow flattening. Instruction substitution replaces an instruction with one or more instructions that have the same effect, and the obfuscation technique simply aims to replace a simple instruction (such as an addition, subtraction or boolean operator) with a functionally equivalent but more complex sequence of instructions. Control flow forgery complicates control flow diagrams without changing code execution logic by ensuring execution of true branches by adding opaque predicates and conditional jump instructions, and randomly selecting and filling garbage instructions on false branches. Control flow flattening moves all basic blocks in the code inside Switch statements, and converts the original jumps into data-driven jumps. The O-LLVM can resist the analysis of a control flow graph and an instruction sequence, but has the defects that a large number of redundant instructions are introduced, so that the running cost of a generated executable file is huge and reaches about 25 times of cost, and the O-LLVM does not change the calling relation of a function, so the resistance of the O-LLVM to the analysis of the function call graph is basically zero.

It follows that existing code obfuscation techniques have a large impact on the performance of the program and are substantially difficult to combat existing analysis techniques.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method and an electronic device for processing software code.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the invention, there is provided a method of processing software code, comprising: acquiring a software code to be processed, and constructing a control flow graph of the software code; according to the control flow graph and preset processing logic, splitting and/or aggregating selected functions in the software codes to change the original structure of the control flow graph, wherein the splitting is to split one function into a plurality of functions and convert the control flow relationship in the function into a call relationship among the functions, and the aggregating is to aggregate a plurality of functions into one function. It should be understood that the specific functions selected in the splitting process and the aggregation process are not required to be exactly the same.

In some embodiments of the present invention, the corresponding function in the software code is split first and then aggregated. It should be understood that when the aggregation processing is performed later, the object related to the corresponding function becomes substantially different from the object related to the corresponding function before the splitting processing due to the splitting processing.

In some embodiments of the invention, the splitting process comprises: splitting an original function into a plurality of functions, wherein one function reserves an original function name, and unique function names are respectively distributed for other functions; based on the original function name left for use and the unique function name obtained by other functions, the control flow relationship before splitting is changed into the calling relationship among the functions.

In some embodiments of the invention, the splitting process comprises: aiming at undefined temporary variables and/or global variables in other split functions, adding a Caller function in the split functions, and transferring the required temporary variables and/or global variables from the functions which are corresponding to the other split functions and leave the original function names through the Caller function during execution; and redefining the scalar quantities in the other split functions according to the values recorded in the corresponding original functions.

In some embodiments of the invention, the splitting process comprises: when splitting, the splitting granularity specified by the user is obtained to control the granularity corresponding to the subtree, the loop, the branch, the basic block, the instruction or the combination thereof to split.

In some embodiments of the invention, the splitting process comprises: and randomly rearranging the positions of the split functions in the software codes.

In some embodiments of the invention, the aggregation process comprises: collecting the aggregation reference information in the current software code, wherein the aggregation reference information comprises the calling relation among functions and whether the functions are related to recursion or not; and according to the aggregation reference information, performing aggregation processing on the functions in the current software code under the condition of ignoring the functions related to recursion.

In some embodiments of the invention, the aggregation reference information further comprises: whether a function is related to a loop, the aggregation process comprising: and analyzing the heat degree of each function according to a thermal code analysis technology, and carrying out aggregation processing on the functions which are lower than a first threshold and are related to the circulation and the functions which are lower than a second threshold and are unrelated to the circulation, wherein the first threshold is less than or equal to the second threshold.

In some embodiments of the invention, the aggregation process comprises: aggregating at least two functions into an aggregation function, assigning a unique function name to the aggregation function and assigning unique branch labels in the aggregation function to at least two functions before aggregation respectively; and modifying the calling of the function before aggregation in the calling function which needs to directly call a certain branch in the aggregation function into the calling of the aggregation function and the branch thereof.

In some embodiments of the invention, the aggregation process comprises: aggregating at least two functions into an aggregation function, assigning a unique function name to the aggregation function and assigning unique branch labels in the aggregation function to at least two functions before aggregation respectively; performing byte alignment processing on all functions to enable part of bits in a function pointer to be in an idle state; inserting a preset assignment function into a caller needing to indirectly call a certain function to add call control information into a function pointer of the caller, wherein the call control information comprises an aggregation function and a branch label in the aggregation function, and the branch label indicates whether the called function is the aggregation function; and inserting analysis codes for analyzing the call control information at all indirect call points, wherein when the aggregation function is analyzed to be required to be indirectly called, parameters required for calculation and branch labels recorded in the pointers are transferred to the called aggregation function through the analysis codes.

According to a second aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store executable instructions; the one or more processors are configured to implement the steps of the method of the first aspect via execution of the executable instructions.

Compared with the prior art, the invention has the advantages that:

the invention provides a method for processing software codes, which comprises the steps of constructing a control flow graph of the software codes by acquiring the software codes to be processed; according to the control flow graph and preset processing logic, splitting and/or aggregating selected functions in the software codes to change the original structure of the control flow graph, wherein the splitting is to split one function into a plurality of functions and convert the control flow relationship in the function into a call relationship among the functions, and the aggregating is to aggregate a plurality of functions into one function. Therefore, the technical scheme can not seriously affect the program performance because a large number of irrelevant instructions are added as in the existing code obfuscation technology, but the control flow graph and the call relation (which are equivalent to changing the function call graph) in the original software code are changed, so that the code comparison technology based on the control flow graph and/or the function call graph analysis can be effectively resisted under the condition of ensuring the program performance.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of a method of processing software code according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a method for processing software code before and after a splitting process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the method of processing software code before and after aggregation processing according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an indirect call made by a method of processing software code according to an embodiment of the invention;

FIG. 5 is a schematic diagram of the effect of inserting non-transparent predicates according to a method of processing software code of an embodiment of the present invention;

FIG. 6 is a diagram illustrating the effect of combining two alternative ways of deep aliasing in a method for processing software code according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a system for processing software code implementing a split process according to an embodiment of the invention;

FIG. 8 is a schematic diagram of a system for processing software code implementing an aggregation process according to an embodiment of the invention;

FIG. 9 is a schematic diagram of a system for processing software codes according to an embodiment of the present invention, where the system implements a split-first process followed by an aggregate process

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, code alignment techniques are currently largely combated by code encryption techniques and code obfuscation techniques. Among them, the code encryption technology is easy to be cracked because of the existing of various mature decryption means and/or tools. The code obfuscation technology realizes obfuscation through instruction replacement, control flow forgery and control flow flattening, a large number of redundant instructions are introduced, although analysis of a control flow graph and an instruction sequence can be resisted, the operation cost of a generated executable file is huge, and the calling relation of a function is not changed, so that the resistance of the code obfuscation technology to function call graph analysis is basically zero. Therefore, referring to fig. 1, the present invention provides a method for processing software codes, which constructs a control flow graph of the software codes by acquiring the software codes to be processed; according to the control flow graph and preset processing logic, splitting and/or aggregating selected functions in the software codes to change the original structure of the control flow graph, wherein the splitting is to split one function into a plurality of functions and convert the control flow relationship in the function into a call relationship among the functions, and the aggregating is to aggregate a plurality of functions into one function. Therefore, the technical scheme can not seriously affect the program performance because a large number of irrelevant instructions are added as in the existing code obfuscation technology, but the control flow graph and the call relation (which are equivalent to changing the function call graph) in the original software code are changed, so that the code comparison technology based on the control flow graph and/or the function call graph analysis can be effectively resisted under the condition of ensuring the program performance.

Before describing embodiments of the present invention in detail, some of the terms used therein will be explained as follows:

a basic block refers to a statement sequence formed by the sequential execution of one instruction or several instructions. Wherein control flow can only be entered in the first statement of a basic block and stay in the last statement without stopping or branching.

A Control Flow Graph (CFG), also called a Control Flow Graph, represents the possible Flow of execution of all basic blocks within a software code (program) or process. A control flow graph can reflect the real-time execution of a program or process. A control flow graph is an abstract representation of a program or process, and is an abstract data structure used in a compiler, maintained internally by the compiler, that represents all the paths that a program will traverse during its execution.

An intra-function control flow refers to the logical order of execution of instructions within a function in software code.

A function Call Graph (CG) is a directed Graph showing the Call relationship between functions in software code.

Circulation, refers to a circulating structure. A loop structure is a logical structure provided in software code for repeatedly executing a certain function. Cycles by number and cycles by conditions are common.

Branched, refers to a branched structure.

A dominating subtree refers to a subtree of a dominating tree. For example, the subset extracted from the dominance tree corresponding to a function is the dominance subtree of the dominance tree. The dominant Tree (dominant Tree) is a Tree composed of basic blocks and dominant relationships between the basic blocks.

The splitting process is to split a function into a plurality of functions and convert the control flow relationship in the function into the call relationship between the functions. In other words, the splitting process is to split a function into a plurality of functions and convert the intra-function control flow relationship into the inter-function call relationship.

The aggregation processing is to aggregate a plurality of functions into one function and convert the call relation between the functions into the control flow relation in the function.

In the process of research on code obfuscation, the applicant finds that the existing code obfuscation technology has a large influence on the performance of the program, is substantially difficult to resist the existing analysis technology, and is not beneficial to protecting the intellectual property of a program publisher. Therefore, the applicant has conceived a method for processing software code, which splits a function in the software code into a plurality of functions through a splitting process, and aggregates the corresponding functions through an aggregation process after the splitting process, so as to change the structure of a control flow graph and the structure of a function call graph, without reducing the performance of a program too much, but better against the existing analysis techniques. After a program publisher changes the source code of the software (namely the software code to be processed) each time, the method can be adopted to process the source code and then publish the corresponding program (the processed software code and the machine code), so that the control flow graph and the function call graph of the program published each time are mixed up relative to the source code, the analysis difficulty of an analyst is increased, and the intellectual property of the program publisher is better protected. In addition, an implementer can also perform only splitting processing on the software code to be processed and then issue the program, or perform aggregation processing on the software code to be processed and then issue the program, so that the control flow in the function and the call graph between the functions are changed, and certain confusion strength is achieved.

The following description will be made in order of technical embodiments corresponding to the splitting process, the aggregation process, and a combination of both processes.

According to one embodiment of the invention, the invention can realize the purpose of confusion by splitting certain functions in the software codes. Preferably, the present invention provides a method of processing software code, comprising: acquiring a software code to be processed, and constructing a control flow graph of the software code; according to the control flow graph and preset processing logic, splitting a selected function in the software code to change the original structure of the control flow graph, wherein the splitting is to split a function into a plurality of functions and convert the control flow relationship in the function into the call relationship between the functions. The splitting process comprises the steps of splitting a function or adding calling relations among a plurality of split functions to rebuild data dependence so as to keep the software function unchanged. Wherein constructing the control flow graph of the software code comprises: performing lexical analysis and syntactic analysis on a software code to be processed to construct an abstract syntax tree; and constructing a control flow graph of the software code according to the node type and the connection relation in the abstract syntax tree. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the core idea of function splitting is to split a function into a plurality of functions, and convert the control flow relation in the function into the call relation among the functions, thereby not only changing the original control flow graph, but also blurring the call relation among the processes, thereby effectively resisting the code comparison technology based on the control flow graph and/or function call graph analysis and protecting the intellectual property of software code owners.

According to an embodiment of the present invention, the splitting granularity during splitting processing can be set according to a user-defined processing logic, so that different requirements are met to selectively split corresponding functions. Preferably, the splitting process comprises: when splitting, the splitting granularity specified by the user is obtained to control the granularity corresponding to the subtree, the loop, the branch, the basic block, the instruction or the combination thereof to split. For example, the structural regions corresponding to branches, loops, basic blocks, or combinations thereof within the control flow graph are split. For another example, a plurality of dominant subtrees are randomly separated from a dominant tree according to a control flow graph, and the code fragments (structural regions) related to the extracted dominant subtrees are split. If the combination is a combination of several granularities, the order of several splitting granularities existing in the combination can be defined by user, the splitting of the corresponding granularities is carried out in sequence according to the appointed order, and if the order is not appointed, the splitting order of the corresponding granularities is randomly distributed. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the application provides a cross-granularity splitting technology to resist against a code comparison technology, can effectively resist against the code comparison technology, achieves the purpose of giving consideration to performance, has negligible performance overhead to normal application programs, and has the advantages of low confusion cost and flexible implementation.

According to an embodiment of the invention, during the splitting process, it is not necessary to select to split all the functions, and for the proportion of the functions participating in the splitting process, the proportion can be set according to the self-defined processing logic, so as to avoid that the performance of the program is influenced too much by splitting too many functions. Preferably, the splitting process comprises: and acquiring a function splitting ratio appointed by a user during splitting, and multiplying the total number of the functions currently contained in the software code by the function splitting ratio to obtain the splitting number so as to limit the number of the functions participating in splitting. For example, a batch of functions may be randomly selected according to the number of splits as the functions participating in the splitting process, so as to split the batch of functions.

According to an embodiment of the present invention, not all of the above selected functions participating in the splitting process (corresponding to the pre-selection) are necessarily processed actually, because some structures in the selected functions have different execution heat, that is, some structures may be used by high frequency with higher heat, while some structures may be used by lower frequency, which may result in performance impact if the splitting is not distinguished. Therefore, preferably, the splitting process includes: and analyzing the heat degree of the code segments with corresponding granularity in the functions participating in the splitting processing during the splitting processing, and splitting the functions containing the code segments with the heat degree lower than the splitting heat degree threshold. Therefore, the performance of the program is further guaranteed. The degree of heat can be defined by the implementer according to specific situations, for example, when the definition is simplified, the degree of heat of the code fragments corresponding to all granularities is assumed to be represented by the degree of heat of the basic block of the entry, namely, for the degree of heat of one dominant sub-tree, the degree of heat of the basic block at the input, and the corresponding code fragments of the branch, loop and the like structure are the same.

According to one embodiment of the invention, the splitting process comprises: when splitting processing is carried out, the splitting granularity, the function splitting proportion, the splitting heat threshold value or the combination thereof appointed by a user are obtained, and the function needing splitting is selected according to the splitting granularity, the function splitting proportion, the splitting heat threshold value or the combination thereof appointed by the user. The splitting process includes: and aiming at different splitting granularities, the corresponding function splitting proportion and/or the splitting heat threshold value can be independently set. Therefore, the splitting scheme can be adjusted according to actual experience or requirements in the splitting process so as to realize better program performance and better meet the confusion requirements of different users.

According to an embodiment of the present invention, during the splitting process, other structure regions (code fragments for generating other functions, i.e. new functions) split from one primitive function need to be respectively assigned with unique function names, and after splitting, the function corresponding to the head having the primitive function leaves the primitive function name. In order to change the structures of a control flow graph and a function call graph, data dependence is reconstructed based on the reserved original function name and a unique function name obtained by other functions, and the function of software codes is kept unchanged. Rebuilding data dependence includes changing the control flow relationship before splitting to the call relationship between functions.

An exemplary function and its split function example are given below, and the functions of the function include operations of assigning local variables, determining parameters, opening files, loop calculation, closing files, and the like. After splitting, it can be seen that new function names are added to other newly added functions and the control flow relationship is changed into the inter-function call relationship:

the primitive functions before splitting are assumed to be:

another example can be seen in fig. 2, which shows a schematic diagram of the function splitting effect at the dominant subtree level, and assuming that three dominant subtrees in the original function Bar can be split, new functions Bar _2 and Bar _3 are split, and the control flow relationship in the function is changed into an inter-function call relationship.

According to an embodiment of the present invention, in the splitting process, in order to ensure the correctness of the program execution, after the function is split, reconstructing the data dependency further includes performing modified reconstruction on the affected variable and the affected quantity. For example, since part of the code in the original function is extracted to form a new function, the life cycle of some registers and stack variables in the original function will change, and therefore some variables need to be passed and redefined. Undefined temporary variables and/or global variables in the split function need to be transferred from the function corresponding to the currently split function and reserved with the original function name through a Caller function; some scalars (constants, etc.) in the primitive functions need to be redefined in the split functions.

According to an embodiment of the invention, during the splitting process, the split function may be added with a prolog instruction (Prologue) and a stop-phrase instruction (Epilogue) by the splitting logic. However, since the split function is essentially a code segment extracted from the original function, in order to reduce the influence of the extracted code segment on the program performance, the prolog instruction and the ending instruction of the split function can be removed to reduce the number of new instructions introduced by the function split. Therefore, the performance of the processed software code is guaranteed.

According to one embodiment of the invention, the newly generated other function is located in the code layout next to the function that leaves the original function name. To further increase the degree of obfuscation, the positions of the functions in the software code may be randomly rearranged after the splitting process. Therefore, preferably, the method for processing software codes comprises: randomly rearranging according to the definition of a randomly rearranged function range in preset processing logic, wherein the randomly rearranged function range comprises: randomly rearranging only the split other functions, randomly rearranging all the split functions or randomly rearranging all the current functions. And randomly rearranging all the split functions, wherein randomly rearranging the positions of the functions which reserve the original function names and the split other functions in the software codes. After the splitting treatment, the confusion strength can be further improved through random rearrangement so as to confuse an analyst; in the function rearrangement process, all sub-function positions are randomly rearranged, and the rearranged code layout is greatly different from the original code layout, so that the disordered calling relationship can greatly increase the analysis difficulty of an analyst.

According to an embodiment of the invention, the invention can also carry out aggregation processing on corresponding functions according to the control flow graph, so that the control flow graph and the function call graph are changed through the aggregation processing, and the purpose of confusion is realized. The method for processing software codes comprises the following steps: according to the control flow graph and preset processing logic, aggregation processing is carried out on the selected functions in the software codes to change the original structure of the control flow graph, wherein the aggregation processing is to aggregate a plurality of functions into one function and convert the call relations among the functions into the control flow relations in the function. During aggregation processing, aggregation reference information is collected, which is mainly prepared for function aggregation, wherein all call relations of functions are collected, and a mapping table is created for recording a caller and a called function. In addition, some information is additionally recorded, including whether the calling statement is in a loop, whether the calling statement is likely to recurse, and the like. When selecting the aggregation function, the two functions that generate recursion are not selected (i.e., the functions that are not related to recursion are selected) to avoid generating new recursive calls that degrade program performance. And during the polymerization treatment, after the polymerization function is constructed, deleting the code corresponding to the original function polymerized into the fusion function.

According to an embodiment of the invention, during aggregation processing, it is not necessary to select to aggregate all functions irrelevant to recursion, and for the proportion of the functions participating in aggregation processing, the proportion can be set according to a self-defined processing logic, so that the program performance is prevented from being influenced too much by too many aggregated functions. Preferably, the polymerization treatment comprises: and acquiring a function aggregation proportion specified by a user during aggregation processing, and multiplying the function aggregation proportion by the number of functions which are currently contained in the software code and are irrelevant to recursion to obtain an aggregation number for limiting the number of the functions participating in the aggregation processing. For example, a batch of functions may be randomly selected from all functions unrelated to recursion according to the aggregation number as the functions participating in the aggregation process, so as to perform the aggregation process on the batch of functions.

According to an embodiment of the present invention, in the aggregation process, the functions related to the loop and the functions not related to the loop may be processed based on the heat degree distinction. Preferably, the heat degree of each function is analyzed according to a thermal code analysis technology, and the loop-related functions with the heat degree lower than a first threshold and the loop-independent functions with the heat degree lower than a second threshold are aggregated, wherein the first threshold is less than or equal to the second threshold. It is particularly preferred that the first threshold value is smaller than the second threshold value. Thus, the number of loop-related functions aggregated can be reduced to reduce the effect of the loop-related functions aggregated too much on the performance of the program. For example, for a function related to a loop, the number of calls of the function and the execution frequency of an internal basic block are analyzed by a hot code analysis technique, and a code whose execution frequency (which represents the heat degree of the function and is the product of the sum of the number of calls of the function and the execution frequency of each basic block) is lower than a certain threshold is selected to be put into a function set to be aggregated, so that the overhead influence caused by aggregation can be reduced. Similarly, for functions unrelated to cycles, functions with heat degrees lower than a certain threshold value can be selected from the functions and put into the function set to be aggregated. And then randomly selecting a preset number of functions from the functions in the function set to be aggregated, copying the function bodies of the selected functions into different branches of the newly-built aggregation function, and determining the branch to which the function is to flow when called through a control parameter.

After the aggregation function is constructed, the call point of the original function before aggregation needs to be changed into the call of the aggregation function. The calling of the function is divided into a direct calling mode and an indirect calling mode. Direct calling is explicitly calling a function by a function name, which represents a function address, and calling a function name is equivalent to directly accessing a function by a function address. Indirect calling is to call a function implicitly by means of a function pointer, which may undergo multi-level propagation of assignment, operation, and access, not only within a module, but also in a propagation relationship between modules, such as a callback function. In other words, the indirect call point calls the function by means of the function pointer, and it is necessary to know which function the function pointer specifically points to, so as to replace it with the corresponding aggregation function. If the path traveled by the function pointer is short, for example, the last instruction assigns a value to the function pointer, and the next instruction is an indirect call, in which case it is easy to know which function is called through static analysis. However, if the path traveled by the function pointer is long and spans functions or even modules, the analysis of the pointer is difficult. Moreover, the overhead of static analysis is significant, and the goal of aggregation achieved in this manner is invaluable. It can be seen that the greatest difference between direct and indirect calls is that direct calls can statically determine the function called, whereas indirect calls often have difficulty knowing to which function the current function pointer specifically points. Function aggregation requires replacing a called function at a calling point with an aggregation function, and the indirect calling situation increases the difficulty of aggregation undoubtedly, because if there is no way to determine the target of indirect calling, it is more impossible to find the aggregation function corresponding to the target. In view of the complexity of indirect calls, the present invention handles the cases of direct calls and indirect calls separately.

According to an embodiment of the present invention, in the case that the call point is a direct call, the call to the original function is directly replaced by the call to the aggregation function and the corresponding branch, and the corresponding aggregation function and branch are found according to the function name and branch tag of the called aggregation function (which are equivalent to the control parameter composed of the function name and branch tag of the called aggregation function).

An exemplary aggregation process referring to fig. 3, the function Foo and the function Bar are aggregated into a function Fusion,

according to an embodiment of the present invention, for the case that the call point is an indirect call, the function pointer needs to be processed to ensure the feasibility and the program performance of the indirect call. As mentioned previously, the difficulty with the indirect call problem is that the aggregation function that should be replaced cannot be determined at the point of the indirect call. Therefore, the present invention advances the time point of replacement. Although function pointers may be subjected to complex calculation and long-distance propagation, each function pointer must have an initial assigned value point, and a corresponding aggregated function is found and replaced by analyzing functions of the assigned value points. Then the function pointer, no matter how propagated, is still pointing to the aggregate function. However, similar to direct calls, calling an aggregation function requires control parameters to determine the control flow trends inside the aggregation function. However, the control parameter cannot be directly replaced at the call point as in a direct call, because the call point cannot always know which function it calls, and cannot determine the value of the control parameter. Therefore, the information of the control parameter (i.e. the call control information) is recorded in the function pointer, and then the call control information is parsed out at the call point and is transmitted to the call instruction.

According to one embodiment of the invention, the problem of indirect calls is solved with a Tag (Tag) pointer (i.e., a function pointer). For example, if all functions are first subjected to 16-byte alignment processing, the lower 4 bits (Bit) of the function pointer are in an idle state, and therefore the call control information can be recorded by using the lower 2 bits. The 0 th bit and the 1 st bit of the function pointer are used to record call control information. Wherein, the 1 st bit is called an aggregation indication bit (or called Ctrl Sign, CS bit) and indicates whether the function of the address has been aggregated; the 0 th Bit is called a branch indication Bit (or Ctrl Bit, CB Bit, for recording a branch label), and indicates the branch (position) of the function in the corresponding aggregation function. It should be understood that, for illustration only, the number of bits occupied by calling the control information may be increased, and more branch indication bits may be added to indicate branches corresponding to more than two functions aggregated in one aggregation function.

By way of example, an exemplary indirect call process is shown in FIG. 4. Wherein the combine _ ptr () function (corresponding to a predetermined valuation function) is responsible for adding call control information (or tag information) to the function pointer, i.e., ctrl _ tag of the function pointer in fig. 4. The upper 0, 1, 63 of the function pointer in the upper right corner of fig. 4 represents the corresponding bit in the function pointer.

For example, when the program executes the bar function, the corresponding call control information is 11, and in the process of subsequent data stream propagation, the call control information added to the function pointer f is also propagated together. Subsequently, a parsing code is inserted at all indirect call points for extracting call control information, whether the CS bit is 1 is checked by an extract _ sign () function in the parsing code, and if 1 indicates that the function to be called (called function) is processed by aggregation, the right branch of the flowchart in fig. 4 (ctrl _ extract _ ctrl (f) value _ value (f) tmp _ val (c, d)) is continued, wherein the CB bit (ctrl parameter) in the function pointer is extracted using the extract _ ctrl () function, and the function pointer is extracted using the extract _ value () function. Finally, the function pointer is called and the CB bits (ctrl parameters) and the parameters needed for the computation (here c, d) are passed into the called function.

For another example, when the program executes the bar function, if the corresponding call control information is 00, the extract _ sign () function in the analysis code finds that the CS bit in the function pointer of the indirect call is 0, which indicates that the called function is not aggregated, and the left branch of the flowchart in fig. 4, that is, the original call statement, will be executed continuously: i/origin call tmp ═ f (c, d).

According to an embodiment of the present invention, inside the function of the aggregation process, deeper obfuscation may also be performed, and preferably, the aggregation process further includes: non-transparent predicates are added in basic blocks inside the aggregation function to modify the instructions of the unconditional jump into conditional branch instructions consisting of the non-transparent predicates (equivalent to deep aliasing at the basic block level here). For example, referring to fig. 5, on the basis of fig. 3, some non-transparent predicates are added, so that the confusion degree is increased. The number or proportion of non-transparent predicates inserted into each aggregation function can be set in a self-defined mode and added into the processing logic. In addition, another deeper obfuscation may be performed, preferably, the aggregation process further comprises: finding basic blocks with non-functional interference after internal instructions are overlapped in the aggregation function, and successively overlapping the instructions in the found basic blocks to synthesize a new basic block (or called a harmless basic block, which is equivalent to deep confusion at an instruction level). For example, for two basic blocks belonging to different primitive functions before aggregation, in the aggregated functions, if the two basic blocks do not interfere with each other and do not destroy the operation result of the program, the two basic blocks can be merged into one basic block. Similarly, the number or proportion of basic blocks in each aggregation function that do instruction-level deep aliasing can be custom set and added to the processing logic. The above two depth aliasing schemes can be used in combination, for example, referring to fig. 6, which shows an exemplary effect of using two alternative depth aliasing schemes in combination.

The invention aims to solve a series of problems of poor confusion effect, overlarge performance overhead, insufficient countervailing capability to a machine learning method and the like of the existing code reuse detection countervailing technology. In order to better achieve the obfuscation effect and protect the intellectual property of the program publisher, according to one embodiment of the present invention, the present invention may use a splitting process and an aggregation process in combination; for example, the splitting process may be performed first, and then the aggregation process may be performed. The invention carries out dynamic splitting and aggregation on the functions, the splitting processing of the functions is to split one function into a plurality of functions, and the calling relation among the functions is used for replacing the control flow relation in the functions, thereby realizing the double confusion of the function calling graph and the control flow graph; the function aggregation processing aggregates at least two functions (the number of the functions before aggregation corresponding to one aggregation function can be set by a user in a self-defined way) into one function, modifies the calling target of a calling point, and more completely destroys the existing calling relation, and the processes are not confused by adding a large number of useless instructions like the prior art, so that a better confusion effect is realized under the condition of reducing the influence on the program performance.

The following is a brief section of software code to be processed to give an exemplary effect after the splitting process and the aggregation process.

Exemplary software code to be processed is:

splitting the software code to be processed to obtain:

after the resolution treatment, polymerization treatment is carried out to obtain:

according to an embodiment of the present invention, the method of the present invention can be implemented at compile time to complete the obfuscation process based on the method of the present invention during the process of converting the high-level language into the corresponding machine language.

The present invention also provides, according to one embodiment thereof, a system for processing software code, comprising: the front-end module is used for acquiring a software code to be processed and constructing a control flow graph of the software code; the splitting module is used for selectively splitting corresponding functions in the software codes according to the control flow graph and preset processing logic so as to change the original structure of the control flow graph, wherein data dependence is reconstructed by increasing a calling relationship; and the aggregation module is used for selectively aggregating corresponding functions in the software codes according to the control flow graph and preset processing logic so as to change the original structure of the control flow graph, wherein data dependence is reconstructed by adjusting the calling relationship.

To improve one embodiment of the present invention, the above method and system may be implemented using an open-source LLVM (Low Level Virtual machine) compiler framework. It should be appreciated that although implemented here based on an LLVM open source compiler framework, the technical logic therein can be adaptively ported to other compiler frameworks such as: GCC, TCC, ICC, LCC, a home-made compiler LOONGCC and the like.

Wherein the LLVM compiler framework is a set of modular, reusable compilers and toolchain techniques. Conventional compilers, such as GCC (GNU Compiler Collection), are typically divided into a front-end (Frontend), a middle-end Optimizer (Optimizer), and a back-end (Backend). The front end is responsible for analyzing the source code and checking errors of a Syntax level, and constructing an Abstract Syntax Tree (AST); the middle end is responsible for code optimization of AST or other intermediate representation; the back end is responsible for generating binary codes of the corresponding platform. LLVM also divides into these three phases, but unlike the traditional compiler-coupled design, its front-end, terminal, and back-end are highly separated modules, and it provides a unified Intermediate Representation (LLVM IR) for all languages and platforms, the Intermediate only optimizes IR, and represents IR' after processing a certain IR.

Most of the logic of LLVM handles compilation optimization and code generation, and these functions are composed of Pass by Pass. A Pass means a "Pass" (also referred to as a pipeline in some places), and one Pass traverses one IR, generates a new IR after processing functions in all modules, and enters the next Pass. For example, middle-end optimization sequentially passes through Pass such as loop unrolling, function convergence, dead code elimination and the like, and back-end optimization and code generation sequentially Pass through Pass such as instruction selection, register allocation and optimization, instruction transmission and the like.

According to an example of the present invention, if the splitting process is to be performed, and the processing logic set by the user is to split the cycle, split the basic block, and perform the random rearrangement in sequence; the splitting module is implemented at the middle end, see fig. 7, and is implemented by a Pass that performs loop splitting, basic block splitting, and random reordering.

According to an example of the present invention, if aggregation processing is to be performed, referring to fig. 8, the aggregation module is implemented by a Pass corresponding to aggregation processing, which collects aggregation reference information.

According to an example of the present invention, a process for implementing splitting and aggregation using LLVM compiler framework in compiling a high level C/C + + language into a low level X86 ELF/PE language at compile time is shown. First, a configuration file is introduced, which contains a source file of the software code and predetermined processing logic, and the LLVM compiler framework parses the source file of the software code through the C/C + + front-end (i.e., front-end module), checks syntax-level errors, and builds a control flow graph. And then sequentially passing through a middle-end optimizer and an X86 back end to obtain the processed software code, namely the obfuscated binary file. The middle-end optimizer integrates a splitting module for executing splitting processing and an aggregation module for executing aggregation processing. In the LLVM compiler, if the hotness is to be judged based on the hot code analysis technique, it can be realized based on blockfrequencyinfo.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of processing software code, comprising:

acquiring a software code to be processed, and constructing a control flow graph of the software code;

according to the control flow graph and preset processing logic, splitting and/or aggregating selected functions in the software codes to change the original structure of the control flow graph, wherein the splitting is to split one function into a plurality of functions and convert the control flow relationship in the function into a call relationship among the functions, and the aggregating is to aggregate a plurality of functions into one function.

2. The method of claim 1, wherein the splitting is performed on the corresponding function in the software code before the aggregation.

3. The method according to claim 1 or 2, wherein the splitting process comprises:

splitting an original function into a plurality of functions, wherein one function reserves an original function name, and unique function names are respectively distributed for other functions;

based on the original function name left for use and the unique function name obtained by other functions, the control flow relationship before splitting is changed into the calling relationship among the functions.

4. The method of claim 3, wherein the splitting process further comprises:

aiming at undefined temporary variables and/or global variables in other split functions, adding a Caller function in the split functions, and transferring the required temporary variables and/or global variables from the functions which are corresponding to the other split functions and leave the original function names through the Caller function during execution; and

and redefining the scalar quantities in the other split functions according to the values recorded in the corresponding original functions.

5. The method according to claim 1 or 2, wherein the splitting process comprises: when splitting, the splitting granularity specified by the user is obtained to control the granularity corresponding to the subtree, the loop, the branch, the basic block, the instruction or the combination thereof to split.

6. The method according to claim 1 or 2, wherein the splitting process comprises: and randomly rearranging the positions of the split functions in the software codes.

7. The method according to claim 1 or 2, wherein the polymerization process comprises:

collecting the aggregation reference information in the current software code, wherein the aggregation reference information comprises the calling relation among functions and whether the functions are related to recursion or not;

and according to the aggregation reference information, performing aggregation processing on the functions in the current software code under the condition of ignoring the functions related to recursion.

8. The method of claim 7, wherein the aggregating reference information further comprises: whether a function is related to a loop, the aggregation process comprising:

and analyzing the heat degree of each function according to a thermal code analysis technology, and carrying out aggregation processing on the functions which are lower than a first threshold and are related to the circulation and the functions which are lower than a second threshold and are unrelated to the circulation, wherein the first threshold is less than or equal to the second threshold.

9. The method of claim 7, wherein the aggregation process comprises:

aggregating at least two functions into an aggregation function, assigning a unique function name to the aggregation function and assigning unique branch labels in the aggregation function to at least two functions before aggregation respectively;

and modifying the calling of the function before aggregation in the calling function which needs to directly call a certain branch in the aggregation function into the calling of the aggregation function and the branch thereof.

10. The method of claim 7, wherein the aggregation process comprises:

performing byte alignment processing on all functions to enable part of bits in a function pointer to be in an idle state;

inserting a preset assignment function into a caller needing to indirectly call a certain function to add call control information into a function pointer of the caller, wherein the call control information comprises an aggregation function and a branch label in the aggregation function, and the branch label indicates whether the called function is the aggregation function;

and inserting analysis codes for analyzing the call control information at all indirect call points, wherein when the aggregation function is analyzed to be required to be indirectly called, parameters required for calculation and branch labels recorded in the pointers are transferred to the called aggregation function through the analysis codes.

11. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of any one of claims 1 to 10.

12. An electronic device, comprising:

one or more processors; and

a memory, wherein the memory is to store executable instructions;

the one or more processors are configured to implement the steps of the method of any one of claims 1-10 via execution of the executable instructions.