CN103440122A

CN103440122A - Novel static function identification method using reverse extension control flow graphs

Info

Publication number: CN103440122A
Application number: CN2013102919410A
Authority: CN
Inventors: 邱景; 苏小红; 马培军; 赵玲玲; 王甜甜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Nenchuang Digital Technology Co ltd
Priority date: 2013-07-12
Filing date: 2013-07-12
Publication date: 2013-12-11
Anticipated expiration: 2033-07-12
Also published as: CN103440122B

Abstract

The invention discloses a novel static function identification method using reverse extension control flow graphs, which belongs to the field of software reverse engineering. The method comprises the following steps: 1, building a set of regional reverse extension control flow graphs; 2, denoising the reverse extension control flow graphs and deleting nodes which can not be generated by a compiler and are searched in a building process; 3, deleting and combining the reverse extension control flow graphs; 4, identifying a function entry in the reverse extension control flow graphs; 5, acquiring the identification results of a plurality of functions in a specified region. Compared with the conventional method, the novel static function identification method has the characteristics that return instructions of functions are taken as identification features, and function return instruction nodes are taken as reverse search starting points to construct the reverse extension control flow graphs, so that a plurality of functions can be identified in a specified binary code region, and functions without specific header byte features and cross reference which cannot be identified by using the conventional static identification method can be identified effectively.

Description

Novel static function identification method using reverse expansion control flow graph

Technical Field

The invention belongs to the field of software reverse engineering, and relates to a static function identification method using a reverse expansion control flow graph.

Background

Binary code review is a security audit process performed on binary code. Software of a certain size inevitably uses third party components. And third party components tend to lack source code. Such as microsoft system dynamic link libraries. To code review such software, reverse engineering is almost the only option. On the other hand, malicious codes are rampant at present, seriously threatening the safety of the computer system, and the detection of the malicious codes is particularly important for improving the safety of the computer system. However, most malicious codes cannot acquire the source codes, so reverse engineering is almost the only analysis means.

The reverse engineering includes disassembling, recognizing function and high level language elements, such as library function, variable, structure, etc. And finally, identifying the operation semantics of each function, and further enhancing the understanding of the whole program semantics by analyzing the cross references among the functions. It can be seen from the above steps that the recognition function is a crucial link in the whole reverse engineering. The conventional static identification method uses the characteristics of the beginning bytes of the function and the cross reference information between the functions to identify the function. Functions without significant features and cross-references often exist in large numbers in binary code, and thus traditional static identification methods cannot effectively identify such functions.

Disclosure of Invention

In order to solve the problem of identifying multiple functions without significant features and cross-references in a specified binary code region, the invention provides a novel static function identification method using a reverse-extended control flow graph.

The basic idea of the technical scheme adopted by the invention for solving the technical problem is as follows: for a specified binary code region, it is assumed that all addresses conforming to the characteristics of a function return instruction (generally, a RET instruction) are function return addresses, and then a corresponding Reverse Extended Control Flow Graph (RECFG) is constructed from the function return addresses from bottom to top. The reverse extended control flow graph refers to a control flow graph, wherein nodes represent instructions, edges represent instruction control dependencies, but different from the traditional control flow graph, the control flow graph reversely constructs an extended control flow graph from function return instruction nodes, and the nodes in the graph are predecessors of all possible instructions in front of the nodes. The reverse search starting point of the graph is a function return instruction, which contains a control flow graph of the function to be identified, so that the traditional control flow graph is a subgraph of the RECFG. For any two RECFGs, they have and have only three relationships: 1) independent of each other, i.e. they are not connected to each other; 2) subgraphs which are both a graph, namely, the subgraphs belong to a function; 3) conflicts, i.e. the search starting point (function return instruction) of one graph is part of the operand of an instruction of another graph. For two graphs that conform to relationship 2), they can be merged. And for two graphs with conflict relationship, deleting one graph by adopting a multi-attribute decision ideal point method to solve the conflict. Eventually all independent RECFGs correspond to one function. There is only one entry point for a function, which may be any node in a RECFG. And traversing the RECFG from the node to obtain a subgraph, and actually controlling the flow graph by using the node as a function with an entry point. The function identification problem thus translates into an identification problem of the entry point of the function. And finally, according to the attribute of the control flow graph corresponding to each node, identifying the entry point of the function by using a multi-attribute decision ideal point method.

The invention discloses a static function identification method using a reverse expansion control flow graph, which comprises the following steps:

step 1: establishing a set of regional reverse expansion control flow diagrams;

step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the RECFG construction process and can not be generated by a compiler;

and step 3: deleting and merging the reverse expansion control flow graph;

and 4, step 4: identifying a function entry in a reverse-expansion control flow graph;

and 5: and obtaining the recognition results of the plurality of functions in the designated area.

Different from the traditional method, the method takes the return instruction of the function as the identification characteristic, takes the node of the return instruction of the function as the reverse search starting point to construct the reverse extension control flow graph, can identify a plurality of functions in the specified binary code area, and can effectively identify the functions which can not be identified by the traditional static identification method and have no specific head byte characteristic and no cross reference.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a reverse expansion control flow graph construction algorithm;

FIG. 3 is a reverse search algorithm in a reverse extended control flow graph construction algorithm;

FIG. 4 is a schematic diagram of a reverse expansion control flow graph;

fig. 5 is the recognition result of fig. 4.

Detailed description of the inventionMeans for

The technical solution of the present invention is further described below with reference to the accompanying drawings, but the present invention is not limited thereto, and modifications or equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

The first embodiment is as follows: in the static function identification method using the reverse expansion control flow graph in the embodiment, first, for a specified binary code region, a corresponding reverse expansion control flow graph is constructed from all addresses conforming to the characteristics of function return instructions; by calculating three attributes of the graph: the total instruction length, the circle complexity and the percentage of all-digit instructions, and a multi-attribute decision ideal point method is used for solving possible conflicts in the graph; and finally, converting the function identification into the identification problem of the function entry point, namely calculating three attributes of the sub-graph obtained by traversing each non-front-driving node of each graph, and deciding the function entry point by using an ideal point method to obtain a final function identification result. The method comprises the following specific steps:

step 1: establishing a set of regional reverse expansion control flow diagrams:

for a specified code area, constructing a corresponding Reverse Extended Control Flow Graph (RECFG) from bottom to top from all addresses conforming to the characteristics of function return instructions to form a set of the area Reverse Extended Control Flow Graph.

As shown in fig. 2, the construction process of the RECFG is divided into two major steps. First, the initial graph is constructed in reverse: the search starting point is first added to the graph, and all possible predecessors of the newly added point in the graph are repeatedly searched. FIG. 3 shows a search for a point in RECFGvAlgorithm for all possible predecessors, specifically detecting whether all possible length instructions arevPrecursor of (a): if bytes and points of an instructionvThe bytes of the same length before (lower address) the corresponding instruction address are consistent, then the instruction is a pointvA front ofAnd (5) driving. And secondly, adding a control flow graph starting from jump targets of branch instructions in the graph into the current RECFG by using a conventional recursive traversal method based on a control flow. Fig. 4 shows a schematic diagram of a RECFG.

The reverse expansion control flow graph RECFG is an intermediate representation for function identification, which is provided by the invention and refers to a control flow graph, wherein nodes represent instructions, edges represent instruction control dependencies, but the control flow graph is different from a traditional control flow graph, and the predecessors of the nodes in the reverse expansion control flow graph are all instructions which may exist in front of the predecessors.

Step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the RECFG construction process and can be generated by a non-compiler:

step 21: deleting illegal conditional branch instructions, including conditional branch instructions without predecessor nodes, conditional branch instructions with jump targets of illegal memory addresses, and conditional branch instructions with byte overlapping instructions in two branches;

step 22: deleting paths which do not take the function return instruction, the jump instruction and the CALL instruction as the ends;

step 23: checking each path, and deleting nodes (such as pushads without pads on the path) and all predecessors thereof, wherein the paired instructions are not matched;

step 24: deleting a high-authority instruction and all predecessors thereof, including an interruption related instruction, a shutdown instruction and a CPU special register operation instruction;

step 25: deleting instructions without practical significance, including NOP, breakpoint instructions, 0 adding instructions, 0 subtracting instructions, logic OR of 0, shift operation instructions with 0 moving times, and transmission instructions with the same source operands and destination operands;

step 26: taking a subgraph containing a reverse search starting point as a new reverse expansion control flow graph;

step 27: and repeating the steps 21-26 until the graph is not changed any more.

And step 3: deleting and merging the reverse extended control flow graph:

step 31: enumerating any two RECFGs in the region RECFG set, and if the reverse search starting point (function return instruction) of one graph is in the other graph, deleting the graph with less nodes from the region RECFG set;

step 32: enumerating any two RECFGs in the area RECFG set, and if the reverse search starting point (function return instruction) of one graph is a part of a corresponding instruction of a certain point in another graph, deleting the graph which is relatively poor by using a multi-attribute decision ideal point method according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph.

And 4, step 4: identifying a function entry in a reverse-expansion control flow graph:

step 41: a control flow graph taking each point without predecessor as an entry point is obtained by traversing from each point without predecessor in the graph;

step 42: and (3) according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph, using a multi-attribute decision ideal point method to decide an optimal control flow graph as a function recognition result (as shown in FIG. 5).

The second embodiment is as follows: in a difference from the first embodiment, the present embodiment provides a multi-attribute decision ideal point method used in step 32 and step 42, which includes the following steps:

(1) calculating a decision matrix:

the evaluation indexes are respectively: the sum of the instruction lengths, the round-robin complexity, the percentage of full-digit instructions of the graph. The full-digit instruction is an instruction with the maximum digit which can be processed by the CPU at one time. Weights of these 3 evaluation indexesw _j(j=1,2,3) are 0.5, 0.38, 0.12, respectively.

Suppose there ismAnd (4) selecting 3 alternative schemes, wherein the decision matrix of the evaluation indexes is as follows:

Dmiddle elementx _ijIs shown asi(iii) of alternative solutionjThe value of each evaluation index.

(2) Calculating a normalized decision matrix:

wherein,w _jis the firstjThe weight of each evaluation index.

(4) Determining a positive ideal solution and a negative ideal solution according to the weighted judgment matrix:

the positive ideal solution is

The negative ideal solution is

. Wherein:

A ^*andA ^-in each case being。

(5) Calculating Euclidean distances between each alternative and the positive ideal solution:

alternative schemeiTo the ideal solutionA ^*A distance of

Alternative schemeiTo the ideal solutionA ^-A distance of

。

(6) Calculating relative closeness of each alternativeC ^*：

Alternative schemeiRelative closeness of

。

(7) The alternatives are ordered according to relative closeness size.

In the sorting result, if the closeness isC ^*The larger the value is, the better the alternative is, and the scheme corresponding to the value is the optimal scheme.

Claims

1. A new static function identification method using a reverse expansion control flow graph is characterized by comprising the following steps:

step 1: establishing a set of regional reverse expansion control flow diagrams;

step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the process of constructing the reverse expansion control flow graph and can not be generated by a compiler;

and step 3: deleting and merging the reverse expansion control flow graph;

2. The method according to claim 1, wherein the specific process of step 1 is as follows:

and for the specified code region, constructing a corresponding reverse expansion control flow graph from bottom to top from all addresses conforming to the characteristics of the function return instruction, and forming a set of region reverse expansion control flow graphs.

3. The new static function identification method using reverse-expansion control flow graph according to claim 2 is characterized in that the construction process of the reverse-expansion control flow graph is divided into two steps:

first, the initial graph is constructed in reverse: firstly, adding a search starting point into the graph, and repeatedly searching all possible predecessors of the newly added point in the graph;

and secondly, adding the control flow graph starting from the jump targets to the current reverse expansion control flow graph by using the existing common control flow-based recursive traversal method for the jump targets of the branch instructions in the graph.

4. The new static function identification method using the reverse-expansion control flow graph according to claim 1, wherein the step 2 for denoising the reverse-expansion control flow graph comprises the following steps:

step 23: checking each path, and deleting nodes and all predecessors thereof which are not matched with the paired instructions;

step 27: and repeating the steps 21-26 until the graph is not changed any more.

5. The method of claim 1, wherein said step 3 for deleting and merging reverse-expansion control flow graph comprises the steps of:

step 31: enumerating any two reverse expansion control flow graphs in the regional reverse expansion control flow graph set, and deleting a graph with a small number of nodes from the regional reverse expansion control flow graph set if the reverse search starting point of the graph is in the other graph;

step 32: enumerating any two reverse extension control flow graphs in the reverse extension control flow graph set of the region, and if the reverse search starting point of one graph is a part of a corresponding instruction of a certain point in the other graph, deleting the graph which is relatively poor by using a multi-attribute decision ideal point method according to the instruction length sum, the circle complexity and the whole digit instruction percentage of the graph.

6. A new static function identification method using a reverse-extended control flow graph according to claim 1, wherein said step 4 of identifying a function entry in a reverse-extended control flow graph comprises the steps of:

step 42: and according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph, deciding an optimal control flow graph as a function recognition result by using a multi-attribute decision ideal point method.