CN103440122A - Novel static function identification method using reverse extension control flow graphs - Google Patents

Novel static function identification method using reverse extension control flow graphs Download PDF

Info

Publication number
CN103440122A
CN103440122A CN2013102919410A CN201310291941A CN103440122A CN 103440122 A CN103440122 A CN 103440122A CN 2013102919410 A CN2013102919410 A CN 2013102919410A CN 201310291941 A CN201310291941 A CN 201310291941A CN 103440122 A CN103440122 A CN 103440122A
Authority
CN
China
Prior art keywords
control flow
reverse
graph
flow graph
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102919410A
Other languages
Chinese (zh)
Other versions
CN103440122B (en
Inventor
邱景
苏小红
马培军
赵玲玲
王甜甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Nenchuang Digital Technology Co ltd
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201310291941.0A priority Critical patent/CN103440122B/en
Publication of CN103440122A publication Critical patent/CN103440122A/en
Application granted granted Critical
Publication of CN103440122B publication Critical patent/CN103440122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a novel static function identification method using reverse extension control flow graphs, which belongs to the field of software reverse engineering. The method comprises the following steps: 1, building a set of regional reverse extension control flow graphs; 2, denoising the reverse extension control flow graphs and deleting nodes which can not be generated by a compiler and are searched in a building process; 3, deleting and combining the reverse extension control flow graphs; 4, identifying a function entry in the reverse extension control flow graphs; 5, acquiring the identification results of a plurality of functions in a specified region. Compared with the conventional method, the novel static function identification method has the characteristics that return instructions of functions are taken as identification features, and function return instruction nodes are taken as reverse search starting points to construct the reverse extension control flow graphs, so that a plurality of functions can be identified in a specified binary code region, and functions without specific header byte features and cross reference which cannot be identified by using the conventional static identification method can be identified effectively.

Description

Novel static function identification method using reverse expansion control flow graph
Technical Field
The invention belongs to the field of software reverse engineering, and relates to a static function identification method using a reverse expansion control flow graph.
Background
Binary code review is a security audit process performed on binary code. Software of a certain size inevitably uses third party components. And third party components tend to lack source code. Such as microsoft system dynamic link libraries. To code review such software, reverse engineering is almost the only option. On the other hand, malicious codes are rampant at present, seriously threatening the safety of the computer system, and the detection of the malicious codes is particularly important for improving the safety of the computer system. However, most malicious codes cannot acquire the source codes, so reverse engineering is almost the only analysis means.
The reverse engineering includes disassembling, recognizing function and high level language elements, such as library function, variable, structure, etc. And finally, identifying the operation semantics of each function, and further enhancing the understanding of the whole program semantics by analyzing the cross references among the functions. It can be seen from the above steps that the recognition function is a crucial link in the whole reverse engineering. The conventional static identification method uses the characteristics of the beginning bytes of the function and the cross reference information between the functions to identify the function. Functions without significant features and cross-references often exist in large numbers in binary code, and thus traditional static identification methods cannot effectively identify such functions.
Disclosure of Invention
In order to solve the problem of identifying multiple functions without significant features and cross-references in a specified binary code region, the invention provides a novel static function identification method using a reverse-extended control flow graph.
The basic idea of the technical scheme adopted by the invention for solving the technical problem is as follows: for a specified binary code region, it is assumed that all addresses conforming to the characteristics of a function return instruction (generally, a RET instruction) are function return addresses, and then a corresponding Reverse Extended Control Flow Graph (RECFG) is constructed from the function return addresses from bottom to top. The reverse extended control flow graph refers to a control flow graph, wherein nodes represent instructions, edges represent instruction control dependencies, but different from the traditional control flow graph, the control flow graph reversely constructs an extended control flow graph from function return instruction nodes, and the nodes in the graph are predecessors of all possible instructions in front of the nodes. The reverse search starting point of the graph is a function return instruction, which contains a control flow graph of the function to be identified, so that the traditional control flow graph is a subgraph of the RECFG. For any two RECFGs, they have and have only three relationships: 1) independent of each other, i.e. they are not connected to each other; 2) subgraphs which are both a graph, namely, the subgraphs belong to a function; 3) conflicts, i.e. the search starting point (function return instruction) of one graph is part of the operand of an instruction of another graph. For two graphs that conform to relationship 2), they can be merged. And for two graphs with conflict relationship, deleting one graph by adopting a multi-attribute decision ideal point method to solve the conflict. Eventually all independent RECFGs correspond to one function. There is only one entry point for a function, which may be any node in a RECFG. And traversing the RECFG from the node to obtain a subgraph, and actually controlling the flow graph by using the node as a function with an entry point. The function identification problem thus translates into an identification problem of the entry point of the function. And finally, according to the attribute of the control flow graph corresponding to each node, identifying the entry point of the function by using a multi-attribute decision ideal point method.
The invention discloses a static function identification method using a reverse expansion control flow graph, which comprises the following steps:
step 1: establishing a set of regional reverse expansion control flow diagrams;
step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the RECFG construction process and can not be generated by a compiler;
and step 3: deleting and merging the reverse expansion control flow graph;
and 4, step 4: identifying a function entry in a reverse-expansion control flow graph;
and 5: and obtaining the recognition results of the plurality of functions in the designated area.
Different from the traditional method, the method takes the return instruction of the function as the identification characteristic, takes the node of the return instruction of the function as the reverse search starting point to construct the reverse extension control flow graph, can identify a plurality of functions in the specified binary code area, and can effectively identify the functions which can not be identified by the traditional static identification method and have no specific head byte characteristic and no cross reference.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a reverse expansion control flow graph construction algorithm;
FIG. 3 is a reverse search algorithm in a reverse extended control flow graph construction algorithm;
FIG. 4 is a schematic diagram of a reverse expansion control flow graph;
fig. 5 is the recognition result of fig. 4.
Detailed description of the inventionMeans for
The technical solution of the present invention is further described below with reference to the accompanying drawings, but the present invention is not limited thereto, and modifications or equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.
The first embodiment is as follows: in the static function identification method using the reverse expansion control flow graph in the embodiment, first, for a specified binary code region, a corresponding reverse expansion control flow graph is constructed from all addresses conforming to the characteristics of function return instructions; by calculating three attributes of the graph: the total instruction length, the circle complexity and the percentage of all-digit instructions, and a multi-attribute decision ideal point method is used for solving possible conflicts in the graph; and finally, converting the function identification into the identification problem of the function entry point, namely calculating three attributes of the sub-graph obtained by traversing each non-front-driving node of each graph, and deciding the function entry point by using an ideal point method to obtain a final function identification result. The method comprises the following specific steps:
step 1: establishing a set of regional reverse expansion control flow diagrams:
for a specified code area, constructing a corresponding Reverse Extended Control Flow Graph (RECFG) from bottom to top from all addresses conforming to the characteristics of function return instructions to form a set of the area Reverse Extended Control Flow Graph.
As shown in fig. 2, the construction process of the RECFG is divided into two major steps. First, the initial graph is constructed in reverse: the search starting point is first added to the graph, and all possible predecessors of the newly added point in the graph are repeatedly searched. FIG. 3 shows a search for a point in RECFGvAlgorithm for all possible predecessors, specifically detecting whether all possible length instructions arevPrecursor of (a): if bytes and points of an instructionvThe bytes of the same length before (lower address) the corresponding instruction address are consistent, then the instruction is a pointvA front ofAnd (5) driving. And secondly, adding a control flow graph starting from jump targets of branch instructions in the graph into the current RECFG by using a conventional recursive traversal method based on a control flow. Fig. 4 shows a schematic diagram of a RECFG.
The reverse expansion control flow graph RECFG is an intermediate representation for function identification, which is provided by the invention and refers to a control flow graph, wherein nodes represent instructions, edges represent instruction control dependencies, but the control flow graph is different from a traditional control flow graph, and the predecessors of the nodes in the reverse expansion control flow graph are all instructions which may exist in front of the predecessors.
Step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the RECFG construction process and can be generated by a non-compiler:
step 21: deleting illegal conditional branch instructions, including conditional branch instructions without predecessor nodes, conditional branch instructions with jump targets of illegal memory addresses, and conditional branch instructions with byte overlapping instructions in two branches;
step 22: deleting paths which do not take the function return instruction, the jump instruction and the CALL instruction as the ends;
step 23: checking each path, and deleting nodes (such as pushads without pads on the path) and all predecessors thereof, wherein the paired instructions are not matched;
step 24: deleting a high-authority instruction and all predecessors thereof, including an interruption related instruction, a shutdown instruction and a CPU special register operation instruction;
step 25: deleting instructions without practical significance, including NOP, breakpoint instructions, 0 adding instructions, 0 subtracting instructions, logic OR of 0, shift operation instructions with 0 moving times, and transmission instructions with the same source operands and destination operands;
step 26: taking a subgraph containing a reverse search starting point as a new reverse expansion control flow graph;
step 27: and repeating the steps 21-26 until the graph is not changed any more.
And step 3: deleting and merging the reverse extended control flow graph:
step 31: enumerating any two RECFGs in the region RECFG set, and if the reverse search starting point (function return instruction) of one graph is in the other graph, deleting the graph with less nodes from the region RECFG set;
step 32: enumerating any two RECFGs in the area RECFG set, and if the reverse search starting point (function return instruction) of one graph is a part of a corresponding instruction of a certain point in another graph, deleting the graph which is relatively poor by using a multi-attribute decision ideal point method according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph.
And 4, step 4: identifying a function entry in a reverse-expansion control flow graph:
step 41: a control flow graph taking each point without predecessor as an entry point is obtained by traversing from each point without predecessor in the graph;
step 42: and (3) according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph, using a multi-attribute decision ideal point method to decide an optimal control flow graph as a function recognition result (as shown in FIG. 5).
And 5: and obtaining the recognition results of the plurality of functions in the designated area.
The second embodiment is as follows: in a difference from the first embodiment, the present embodiment provides a multi-attribute decision ideal point method used in step 32 and step 42, which includes the following steps:
(1) calculating a decision matrix:
the evaluation indexes are respectively: the sum of the instruction lengths, the round-robin complexity, the percentage of full-digit instructions of the graph. The full-digit instruction is an instruction with the maximum digit which can be processed by the CPU at one time. Weights of these 3 evaluation indexesw j (j=1,2,3) are 0.5, 0.38, 0.12, respectively.
Suppose there ismAnd (4) selecting 3 alternative schemes, wherein the decision matrix of the evaluation indexes is as follows:
Figure 59109DEST_PATH_IMAGE001
Dmiddle elementx ij Is shown asi(iii) of alternative solutionjThe value of each evaluation index.
(2) Calculating a normalized decision matrix:
Figure 976249DEST_PATH_IMAGE002
Figure 569036DEST_PATH_IMAGE003
wherein,w j is the firstjThe weight of each evaluation index.
(4) Determining a positive ideal solution and a negative ideal solution according to the weighted judgment matrix:
the positive ideal solution is
Figure 964245DEST_PATH_IMAGE005
The negative ideal solution is
Figure 778617DEST_PATH_IMAGE007
. Wherein:
A * andA - in each case being
(5) Calculating Euclidean distances between each alternative and the positive ideal solution:
alternative schemeiTo the ideal solutionA * A distance of
Figure 930430DEST_PATH_IMAGE010
Alternative schemeiTo the ideal solutionA - A distance of
Figure DEST_PATH_IMAGE011
(6) Calculating relative closeness of each alternativeC *
Alternative schemeiRelative closeness of
Figure DEST_PATH_IMAGE012
(7) The alternatives are ordered according to relative closeness size.
In the sorting result, if the closeness isC * The larger the value is, the better the alternative is, and the scheme corresponding to the value is the optimal scheme.

Claims (6)

1. A new static function identification method using a reverse expansion control flow graph is characterized by comprising the following steps:
step 1: establishing a set of regional reverse expansion control flow diagrams;
step 2: denoising the reverse expansion control flow graph, and deleting nodes which are searched in the process of constructing the reverse expansion control flow graph and can not be generated by a compiler;
and step 3: deleting and merging the reverse expansion control flow graph;
and 4, step 4: identifying a function entry in a reverse-expansion control flow graph;
and 5: and obtaining the recognition results of the plurality of functions in the designated area.
2. The method according to claim 1, wherein the specific process of step 1 is as follows:
and for the specified code region, constructing a corresponding reverse expansion control flow graph from bottom to top from all addresses conforming to the characteristics of the function return instruction, and forming a set of region reverse expansion control flow graphs.
3. The new static function identification method using reverse-expansion control flow graph according to claim 2 is characterized in that the construction process of the reverse-expansion control flow graph is divided into two steps:
first, the initial graph is constructed in reverse: firstly, adding a search starting point into the graph, and repeatedly searching all possible predecessors of the newly added point in the graph;
and secondly, adding the control flow graph starting from the jump targets to the current reverse expansion control flow graph by using the existing common control flow-based recursive traversal method for the jump targets of the branch instructions in the graph.
4. The new static function identification method using the reverse-expansion control flow graph according to claim 1, wherein the step 2 for denoising the reverse-expansion control flow graph comprises the following steps:
step 21: deleting illegal conditional branch instructions, including conditional branch instructions without predecessor nodes, conditional branch instructions with jump targets of illegal memory addresses, and conditional branch instructions with byte overlapping instructions in two branches;
step 22: deleting paths which do not take the function return instruction, the jump instruction and the CALL instruction as the ends;
step 23: checking each path, and deleting nodes and all predecessors thereof which are not matched with the paired instructions;
step 24: deleting a high-authority instruction and all predecessors thereof, including an interruption related instruction, a shutdown instruction and a CPU special register operation instruction;
step 25: deleting instructions without practical significance, including NOP, breakpoint instructions, 0 adding instructions, 0 subtracting instructions, logic OR of 0, shift operation instructions with 0 moving times, and transmission instructions with the same source operands and destination operands;
step 26: taking a subgraph containing a reverse search starting point as a new reverse expansion control flow graph;
step 27: and repeating the steps 21-26 until the graph is not changed any more.
5. The method of claim 1, wherein said step 3 for deleting and merging reverse-expansion control flow graph comprises the steps of:
step 31: enumerating any two reverse expansion control flow graphs in the regional reverse expansion control flow graph set, and deleting a graph with a small number of nodes from the regional reverse expansion control flow graph set if the reverse search starting point of the graph is in the other graph;
step 32: enumerating any two reverse extension control flow graphs in the reverse extension control flow graph set of the region, and if the reverse search starting point of one graph is a part of a corresponding instruction of a certain point in the other graph, deleting the graph which is relatively poor by using a multi-attribute decision ideal point method according to the instruction length sum, the circle complexity and the whole digit instruction percentage of the graph.
6. A new static function identification method using a reverse-extended control flow graph according to claim 1, wherein said step 4 of identifying a function entry in a reverse-extended control flow graph comprises the steps of:
step 41: a control flow graph taking each point without predecessor as an entry point is obtained by traversing from each point without predecessor in the graph;
step 42: and according to the instruction length sum, the circle complexity and the all-digit instruction percentage of the graph, deciding an optimal control flow graph as a function recognition result by using a multi-attribute decision ideal point method.
CN201310291941.0A 2013-07-12 2013-07-12 A kind of static function recognition methods using reverse expansion controlling stream graph Active CN103440122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310291941.0A CN103440122B (en) 2013-07-12 2013-07-12 A kind of static function recognition methods using reverse expansion controlling stream graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310291941.0A CN103440122B (en) 2013-07-12 2013-07-12 A kind of static function recognition methods using reverse expansion controlling stream graph

Publications (2)

Publication Number Publication Date
CN103440122A true CN103440122A (en) 2013-12-11
CN103440122B CN103440122B (en) 2016-06-08

Family

ID=49693813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310291941.0A Active CN103440122B (en) 2013-07-12 2013-07-12 A kind of static function recognition methods using reverse expansion controlling stream graph

Country Status (1)

Country Link
CN (1) CN103440122B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095470A (en) * 2016-08-17 2016-11-09 广东工业大学 The program comprehension method and system that the cognitive relative importance value of stream drives are controlled based on flattening
CN107704235A (en) * 2017-09-22 2018-02-16 深圳航天科技创新研究院 The analytic method of data flowchart, system and storage medium in mathematics library
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN118502732A (en) * 2024-07-18 2024-08-16 杭州新中大科技股份有限公司 Analysis method, device, equipment and medium of byte code program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968766A (en) * 2010-10-21 2011-02-09 上海交通大学 System for detecting software bug triggered during practical running of computer program
US20120159458A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Reconstructing program control flow
US20130055221A1 (en) * 2011-08-26 2013-02-28 Fujitsu Limited Detecting Errors in Javascript Software Using a Control Flow Graph

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968766A (en) * 2010-10-21 2011-02-09 上海交通大学 System for detecting software bug triggered during practical running of computer program
US20120159458A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Reconstructing program control flow
US20130055221A1 (en) * 2011-08-26 2013-02-28 Fujitsu Limited Detecting Errors in Javascript Software Using a Control Flow Graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱景: "基于基本块划分的库函数快速识别技术", 《计算机工程》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095470A (en) * 2016-08-17 2016-11-09 广东工业大学 The program comprehension method and system that the cognitive relative importance value of stream drives are controlled based on flattening
CN106095470B (en) * 2016-08-17 2019-08-09 广东工业大学 The program comprehension method and system for flowing cognition priority driving are controlled based on flattening
CN107704235A (en) * 2017-09-22 2018-02-16 深圳航天科技创新研究院 The analytic method of data flowchart, system and storage medium in mathematics library
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN118502732A (en) * 2024-07-18 2024-08-16 杭州新中大科技股份有限公司 Analysis method, device, equipment and medium of byte code program

Also Published As

Publication number Publication date
CN103440122B (en) 2016-06-08

Similar Documents

Publication Publication Date Title
US9703565B2 (en) Combined branch target and predicate prediction
JP6605573B2 (en) Parallel decision tree processor architecture
CN110287702B (en) Binary vulnerability clone detection method and device
US8239404B2 (en) Identifying entries and exits of strongly connected components
US10157239B2 (en) Finding common neighbors between two nodes in a graph
US8589888B2 (en) Demand-driven analysis of pointers for software program analysis and debugging
US20130291113A1 (en) Process flow optimized directed graph traversal
US20150262062A1 (en) Decision tree threshold coding
US20180074798A1 (en) Visualisation for guided algorithm design to create hardware friendly algorithms
US9361403B2 (en) Efficiently counting triangles in a graph
CN106062740B (en) Method and device for generating multiple index data fields
US20150262063A1 (en) Decision tree processors
CN114385185B (en) Control flow graph generation method and device for intelligent contract
US9552284B2 (en) Determining valid inputs for an unknown binary program
CN103440122A (en) Novel static function identification method using reverse extension control flow graphs
CN110457046B (en) Disassembles method, disassembles device, storage medium and disassembles terminal for hybrid instruction set programs
US20150363177A1 (en) Multi-branch determination syntax optimization apparatus
US9619362B2 (en) Event sequence construction of event-driven software by combinational computations
CN108304467B (en) Method for matching between texts
JP6160232B2 (en) Compilation program and compilation method
Alkohlani et al. Towards performance predictive application-dependent workload characterization
CN103577728A (en) Method for identifying library functions by using shrinkage executing dependence graphs
CN118036005B (en) Malicious application detection method, system, equipment and medium based on simplified call graph
CN114610606B (en) Binary system module similarity matching method and device based on arrival-fixed value analysis
Dong et al. A new method of software clone detection based on binary instruction structure analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230628

Address after: Building 1, Kechuang headquarters, Shenzhen (Harbin) Industrial Park, 288 Zhigu street, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin Nenchuang Digital Technology Co.,Ltd.

Address before: 150000 No. 92, West Da Zhi street, Nangang District, Harbin, Heilongjiang.

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY