CN109002723B

CN109002723B - Sectional type symbol execution method

Info

Publication number: CN109002723B
Application number: CN201810819763.7A
Authority: CN
Inventors: 胡昌振; 马锐; 窦伯文; 王龙; 高浩然
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-24
Filing date: 2018-07-24
Publication date: 2021-09-07
Anticipated expiration: 2038-07-24
Also published as: CN109002723A

Abstract

The invention adopts a sectional type symbol execution method to carry out coarse-grained division on program segments and adopts a mode of independently executing each program segment to carry out symbolic analysis on the program so as to improve the analysis efficiency and the analysis accuracy of the prior symbol execution tool aiming at large-scale programs and the prior sectional type symbol execution sequence analysis method. A sectional symbol execution method divides a program into a plurality of larger program sections by a clustering method, then performs independent symbol execution on each program section, and then combines symbol execution results of each program section to complete analysis of the whole program.

Description

Sectional type symbol execution method

Technical Field

The invention belongs to the technical field of vulnerability mining in information security, and particularly relates to a sectional type symbol execution method.

Background

Symbolic execution is a tool for software bug detection by using symbolic values instead of specific values, and it can detect errors of a program by analyzing path constraints. Symbolic execution has become one of the effective techniques for finding bugs and security holes in programs, and it has been used for security testing and quality assurance by major software companies such as microsoft. Symbolic execution generally tests programs by acquiring the execution path of the program and inverting the path, and aims to improve the analysis efficiency and the test coverage rate of the program by calculating a program logic expression instead of manually analyzing codes. Although the path tree executed by the symbol is too complex and may cause the path explosion problem, since the path constraint can be calculated, paths which cannot be obtained by other detection technologies, such as fuzzy test, and the like, can be obtained, and therefore special errors can be effectively found. In practice, this approach is also used, and symbolic execution has become an important technology for software error analysis and security vulnerability checking.

There are many symbol execution tools such as angr, KLEE, JPF, etc.

The angr is a binary automated analysis tool developed by the university of california, san babara, and in which the currently popular symbolic execution technology is implemented, with dynamic and static symbolic analysis capabilities on binary programs. The angr was originally used to find backdoors in programs and is now available in the field of software analysis.

KLEE is a tool developed by Stanford university to construct program test cases using symbolic execution technology. When the KLEE analyzes a program to construct a test case, the value range of the symbol is also analyzed at a key program point by utilizing a symbol execution and constraint solving technology, and whether the value range is in a safety range is checked.

JPF is an open source symbolic execution tool for JAVA bytecode program of NASA, and can provide complete symbolic execution functions, including functions of input variable symbolization, basic path constraint generation, program path search and the like.

These tools, as the popular symbol execution tools at present, have better practicability in program analysis, but they all have the same disadvantages. They all have the problems of path explosion and low analysis efficiency when analyzing large programs, which will cause huge expenses.

The sectional type Symbolic Execution is to divide a program into a plurality of sections for analysis, related researches are carried out by researchers at present, and methods similar to the method provided by the invention comprise Xiao Q, Chen Y, Wu C, et al. pbSE: Phase-Based symbolonic Execution [ C ]// IEEE/IFIP International Conference on dependent Systems and networks. IEEE,2017: 133; fangwenqing, segmented symbolic execution model and its environmental interaction problem study [ D ]. beijing post and telecommunications university, 2010. However, these methods have some problems. First, these methods are mainly intended to solve the analysis problem of external processes, rather than to deal with the path explosion problem in symbol execution; secondly, the methods mainly divide the program based on functions, usually divide too many program segments, and the too many segments can seriously cut off the data relation among the program segments, thereby causing the loss of program execution state information and further causing the inaccurate analysis result of symbol execution; thirdly, these methods all adopt a sequential manner to perform symbol execution on the program segments, and there are precedence relationships and corresponding state data between the execution of each program segment, which cannot significantly improve the efficiency of symbol execution.

Disclosure of Invention

In view of the above disadvantages, the present invention employs a segmented symbol execution method to perform coarse-grained division on program segments and perform symbolic analysis on a program in a manner that each program segment is independently executed, so as to improve the analysis efficiency and the analysis accuracy of the current symbol execution tool for large-scale programs and the current segmented symbol execution sequence analysis method.

The invention is realized by the following technical scheme:

a sectional symbol execution method divides a program into a plurality of larger program sections by a clustering method, then performs independent symbol execution on each program section, and then combines symbol execution results of each program section to complete analysis of the whole program.

Furthermore, before the program is divided, a control flow graph is extracted from the program, nodes of the control flow graph are basic blocks of the program, directed edges of the control flow graph are jumps among the basic blocks, and then the control flow graph is divided into a plurality of control flow subgraphs by a clustering method.

Further, setting a weight for each node in the control flow graph, where the node in the control flow graph represents a single basic block in the control flow graph, and taking the number of instructions in each basic block as the weight of the node, which represents the size of the basic block.

Further, the program control flow graph is divided into a plurality of larger control flow subgraphs by a clustering method, and the following division modes are specifically adopted:

selecting edges in a control flow graph according to a clustering algorithm;

step two, deleting the edge selected in the step one in the control flow graph;

step three, calculating the modularity of the control flow graph, if the modularity is improved, updating the control flow graph, otherwise, returning to the step one;

and step four, controlling the flow graph to be divided, and obtaining a divided subgraph which is a result of program segmentation.

Further, the execution of the independent symbol of each program segment specifically adopts the following mode:

(1) determining a starting node and a terminating node in each program segment;

(2) completing the jumping information between the missing basic blocks;

(3) traversing each program segment and selecting a corresponding analysis strategy;

(4) and analyzing according to the adopted corresponding analysis strategy.

Further, the result merging includes state data merging and constraint merging. And the result merging is for two connected program segments, and the result of the execution of the whole program symbol is obtained after the merging of the state data and the constraint condition of all the connected program segments is completed.

The invention has the beneficial effects that:

the invention analyzes the program by adopting an independent analysis method of sectional type symbol execution aiming at the problem of low efficiency of sequence analysis methods of large-scale program symbol execution and the prior sectional type symbol execution. This method allows for independent analysis of the program segments. Through program segment division, independent symbol execution is carried out on each program segment, and results are combined, so that the efficiency of symbol execution is improved.

Aiming at the problems that the existing segmented symbol has too much segmentation execution and seriously isolates the inaccurate operation result caused by the data flow information transmission of the program segment, the invention divides the program segment by coarse granularity through a clustering algorithm and reduces the number of the divided program segments as much as possible on the premise of ensuring that the scale of the program segment does not generate the problem of path explosion as much as possible, thereby relieving the problem of inaccurate program analysis caused by segmentation.

Drawings

FIG. 1 is a flow chart of a segmented symbol execution method according to the present invention.

Detailed Description

The invention provides a sectional type symbol execution method aiming at the problems that the efficiency of large-scale program analysis in original symbol execution is not high, the analysis in the original sectional type symbol execution is not accurate, and the like. The method is different from the traditional method for executing the symbols of the program according to the sequence analysis mode by the sectional type symbol execution, and the method for executing the symbols by independently analyzing the program sections is adopted. The method divides a program into a plurality of larger segments by a clustering method, further performs independent symbolic execution on each segment, and then combines symbolic execution results of each segment to complete the analysis of the whole program. The invention has universality for symbol execution tools, is embodied on an angr tool, and can be also used for other symbol execution tools such as KLEE and JPF.

As shown in fig. 1, the input of the present invention is a program, a control flow graph is generated through control flow analysis, and then the control flow graph is divided through a clustering-based program segment dividing method, so as to divide each program segment. In the next step, independent symbolic execution analysis is performed on each program segment, and the missing of jump information caused by the program segments is completed. After the symbolic execution is completed on each program segment, the merging of the processing results of the program segments is performed, including the merging of the state data, the merging of the constraint conditions and the obtaining of the result of the symbolic execution. The processing procedures of control flow analysis, program segment division method, single program segment symbol execution and result merging will be described below.

1. Control flow analysis

Control flow analysis first extracts a control flow graph from a program. The nodes of the control flow graph are program basic blocks, and the directed edges are jumps between the basic blocks. In this embodiment, an angr tool is used to extract a control flow graph. It is to be understood that in the implementation, the control flow graph may be obtained by other tools.

And modifying the control flow graph on the basis of obtaining the control flow graph by using the angr, and further adding a node weight. The node weight is the number of instructions in a basic block of a program and is used to indicate the size of the basic block. The control flow graph generated at this step is used for subsequent program segment division.

2. Program segment partitioning

Next, program segment division will be performed. Specifically, the control flow graph obtained in the previous step is divided by using a clustering algorithm, and the dividing steps are as follows:

(1) and selecting edges in the control flow graph according to a clustering algorithm.

(2) And deleting the edge selected in the step one in the control flow graph.

(3) And (3) calculating the modularity of the clustering algorithm, if the modularity is improved, updating the control flow graph, and otherwise, returning to the step (1).

(4) And after the control flow graph is divided, obtaining the divided subgraphs.

The divided control flow subgraph is the result of the division of the program segment. In this context, each subgraph corresponds to a program segment that can be executed symbolically independently.

3. Single-pass section symbol execution

The symbolic execution is performed independently on a single program segment, and can be performed in four steps.

(1) Determining a starting node and a terminating node in each program segment;

(2) completing the jumping information between the missing basic blocks;

(4) and analyzing according to the adopted corresponding analysis strategy.

In the step (2), because direct addressing and indirect addressing strategies under the conditions of program calling and returning and the like are not considered when the program basic blocks are divided, the situation that the target address cannot be found when the program returns can be caused, and at this time, corresponding jump information needs to be completed according to the original control flow diagram.

In step (3), an analysis strategy needs to be selected according to the type of the program segment. In this embodiment, if the sequence program is used, a common exploration strategy is selected; if the program is a loop program, a dynamic and static mixed execution strategy is selected.

In the step (4), the existing symbol execution tool is adopted, symbol execution is carried out from each starting node to the ending node from the program segment inlet, and finally the state of execution to the ending node is obtained as the input of the next result combination. In the present embodiment, an angr tool is used to perform symbolic execution, and it is understood that other symbolic execution tools may be used to replace the angr in the specific implementation process.

4. Result merging

The result merging is for two program segments in the original control flow graph that have a connected relationship. And combining the results of all the connected program segments to obtain the result of the execution of the whole program symbol.

Further, for two program segments connected by a directed edge, the program segment at the start node of the directed edge is referred to as an upstream program segment, and the program segment at the end node of the directed edge is referred to as a downstream program segment.

The result merging mainly comprises two parts of state data merging and constraint condition merging.

Further, in this embodiment, the status data merge includes two parts, namely, register merge and memory merge.

And the register combination obtains a register list according to the program architecture information, and then combines the register information in the two program segments in sequence.

The main idea of merging register information is to use the state data result of the upstream program segment as the input of the downstream program segment. Specifically, when the values of two registers are combined, four different situations arise:

(1) the value in the state of the downstream program segment is an actual value rather than a symbolic value, and the value does not need to be replaced;

(2) the value in the state of the downstream program segment is a symbol value, the corresponding value in the state of the upstream program segment is an actual value, and the actual value is required to be substituted into a symbol variable;

(3) the values in the states of the upstream and downstream program segments are actual values, and at this time, two symbolic expressions are required to be substituted into operation;

(4) the register value in the downstream program segment state is not initialized, and the value is directly set as the value of the corresponding register in the upstream program segment state.

The memory merge method is similar to, but different from, the register merge method. Specifically, since the memory is a segment of continuous address, and the length of each read/write is not fixed, the memory data needs to be obtained by inserting the analysis break point, and the written memory address and length need to be recorded during the symbol execution process.

The constraint condition in the constraint condition combination refers to the constraint condition needed by the program to be executed from the program segment inlet to the program segment outlet. The constraint incorporates two steps: the first step is to replace the symbol value in the downstream state constraint condition, and the processing mode of the step is similar to the register merging method; the second step is to copy the constraints in the upstream state to the downstream state.

After all the merging processes are completed, the symbol execution analysis result of the complete program can be obtained, including the state from the program execution to the end node and the corresponding path condition. And finally, calculating whether the state triggers the vulnerability according to the program state, calculating whether the path exists according to the path constraint (namely whether the solution set is empty), and generating the test case when the solution set is not empty.

Claims

1. A sectional symbol execution method is characterized in that a control flow graph is extracted from a program, nodes of the control flow graph are basic blocks of the program, directed edges of the control flow graph are jumps among the basic blocks, the program control flow graph is divided into a plurality of control flow subgraphs by a clustering method, then independent symbol execution is carried out on each program section, and finally symbol execution results of each program section are combined to complete analysis of the whole program; the program control flow graph is divided into a plurality of control flow subgraphs by a clustering method, and the following division modes are specifically adopted:

selecting edges in a control flow graph according to a clustering algorithm;

step two, deleting the edge selected in the step one in the control flow graph;

2. The segmented symbol execution method of claim 1, wherein a weight is set for each node in the control flow graph, the node in the control flow graph represents a single basic block in the control flow graph, and the weight of the node represents the size of the basic block by taking the number of instructions in each basic block.

3. The method as claimed in claim 1 or 2, wherein the independent symbol execution of each program segment is implemented by:

(1) determining a starting node and a terminating node in each program segment;

(2) completing the jumping information between the missing basic blocks;

(4) and analyzing according to the adopted corresponding analysis strategy.

4. The segmented symbol execution method according to claim 1 or 2, wherein the merging includes state data merging and constraint condition merging, and after the completion of the merging of the state data and constraint conditions of all the connected program segments, the result of the whole program symbol execution is obtained.