CN110147235B - Semantic comparison method and device between source code and binary code - Google Patents

Semantic comparison method and device between source code and binary code Download PDF

Info

Publication number
CN110147235B
CN110147235B CN201910249283.6A CN201910249283A CN110147235B CN 110147235 B CN110147235 B CN 110147235B CN 201910249283 A CN201910249283 A CN 201910249283A CN 110147235 B CN110147235 B CN 110147235B
Authority
CN
China
Prior art keywords
bin
src
node
ast
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910249283.6A
Other languages
Chinese (zh)
Other versions
CN110147235A (en
Inventor
袁子牧
冯牧玥
班固
肖扬
许家欢
俞晨东
霍玮
邹维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201910249283.6A priority Critical patent/CN110147235B/en
Publication of CN110147235A publication Critical patent/CN110147235A/en
Application granted granted Critical
Publication of CN110147235B publication Critical patent/CN110147235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • G06F8/434Pointers; Aliasing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking

Abstract

The invention relates to a semantic comparison method and a semantic comparison device between a source code and a binary code. The method comprises the following steps: 1) specifying a statement ST from given source codesrcAnd the key variable V thereinsrcSpecifying a statement ST from binary codebinAnd the key variable V thereinbin(ii) a 2) Are respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1); 3) to VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether the semantics are consistent; 4) according to the key variable VsrcAnd VbinAnd judging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result. The method does not depend on manual intervention and compiling processes, can specify key variables in any source code and binary code functions to be compared, and can improve the comparison success rate and efficiency.

Description

Semantic comparison method and device between source code and binary code
Technical Field
The invention belongs to the field of program analysis, and relates to a static analysis technology based on semantic comparison, which focuses on the direction of code semantic static analysis, specifically checks whether a binary code multiplexes a target source code, and if so, reuses a source code of an unrepaired vulnerability.
Background
Semantic comparison is often used to compare the similarity of source code, binary code, in execution logic from an application-specific perspective. For example, semantic comparison is often used to compare similarity between codes in a certain function, such as whether the codes contain the same malicious function, whether similar encryption and decryption algorithms are used, and the like.
From the current situation of research implementation, the application range of semantic comparison is wide, and a specific semantic comparison scheme needs to be designed for specific problems. The invention focuses on how to extract the same semantics from a source code and a binary code at the same time, and the semantics need to reduce the difference between the source code and the binary code caused by compiling optimization and symbol information removal as much as possible so as to carry out semantic comparison on the source code and the binary code and judge whether the binary code multiplexes a given target source code or not. In comparison granularity, the invention focuses on extracting semantic relation between statements and variables in the function and forming an abstract syntax tree for comparison from the aspect of function level comparison so as to determine whether the source code function exists in the binary code.
The most similar techniques to the present invention include Abstract Syntax Tree (AST), data stream analysis, and control stream analysis techniques. For specific applications (such as semantic comparison), schemes of these technologies need to be customized to achieve the application intended target, for example, the AST technology of tools such as Clang and Antlr can be used to perform syntax analysis on the whole source code project. The invention focuses on the relevant variable set of the given key statement, under the limitation of the control flow analysis condition, the AST of the variable is generated through data flow analysis, and the AST generated by the source code and the binary system at the same time is compared.
From a code comparison perspective, a similar approach typically compiles the source code item into a binary executable file, which is then compared to other binary files ("binary-to-binary" comparison). However, this method requires a complete source code project, and usually requires manual setting of a plurality of compiling parameters and pre-installation of a dependent library before successful compiling. Therefore, the invention directly adopts a 'source code-binary' comparison mode, does not depend on manual setting and compiling processes, and can directly improve the comparison success rate and efficiency. At present, a few public working parts adopt a source code-binary comparison mode, but the public working parts either rely on debugging information (or symbol information) output in a compiling process or compare constants such as character strings and arrays, semantic information is not deeply analyzed, and the method is obviously different from the semantic comparison method adopted by the invention. Compared with the prior art, the method is more universal, for example, each binary code function does not have references to constants such as character strings and arrays, but codes realizing specific functions have corresponding semantic information; release software typically removes debug information and most of the symbol information.
In summary, for a given key comparison statement, the present invention provides a general semantic comparison method between a source code and a binary code. Compared with the existing work, the semantic feature extraction comparison of statements in the source code and binary code functions is automatically completed under the condition of focusing on non-compiling scenes.
Disclosure of Invention
In order to realize the automatic comparison of the semantic features of statement levels in the functions of the source code and the binary code and overcome the limitation that the existing method depends on a compiling process or requires the functions to have constant reference, the invention provides a method for generating AST through data flow analysis under the condition of control flow constraint for given key statements, which is used for realizing the comparison of the source code and the binary code and is called as a semantic comparison method in the following.
To simplify the description complexity, the present invention contemplates such alignment processes: specifying a statement ST from given source codesrcAnd the key variable V thereinsrcWhile also specifying a candidate statement ST from binary codebinAnd intermediate key variable VbinFor VsrcAnd VbinThe ASTs after the data stream analysis are generated for comparison, i.e., process descriptions that do not relate to how statements and key variables are selected to focus on semantic feature extraction and comparison techniques.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a semantic comparison method between a source code and a binary code comprises the following steps:
1) specifying a statement ST from given source codesrcAnd the key variable V thereinsrcSpecifying a phrase from binary codeSentence STbinAnd the key variable V thereinbin
2) Are respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1);
3) to VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether the semantics are consistent;
4) according to the key variable VsrcAnd VbinAnd judging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result.
Further, step 2) comprises:
(1) for key variables V (including V)srcAnd Vbin) Extracting the basic block B of the statement ST to the function entry basic block BeThe longest path P. This step extracts the control flow constraint for variable V, which has the following meaning: let B1<B2Represents a basic block B1Wherein the existing statement has priority over the basic block B in execution order2The statement in (1); let SeteBasic block B representing function entryeSet of (2): for Be∈SeteIn the absence of any basic block Bi<Be(ii) a Let B-BeRepresenting the basic block B to the entry basic block BeNumber of basic blocks passed (including basic blocks B and B)e) Then the length of the longest path can be denoted as max B-Be},Be∈SeteWherein max { } represents a maximum value. For the basic blocks on the longest path P, backtracking by taking B as a starting point, putting the basic blocks into the path in a way of adding the basic blocks first, and adding each basic block only once without being influenced by circulation until the basic block B at the entranceeUntil now.
(2) For the key variable V, the data composition source is traced back along the longest path P through assignment and calculation statements, and the principle is that the data composition source is traced back until the data composition source can not be traced back in a function, and AST is finally formed through extraction and abstraction. The meaning of this principle is that under the constraint of the longest path, a variable V is taken as a root node, an assignment operator or a calculation operator is taken as an intermediate node, and finally data that cannot be traced back within a function is taken as a leaf node.
From the source, a leaf node may contain the following 4 types of data (category 1):
1.1 local variables defined within the function: for example, the source code unscheduled len, X86 architecture binary machine code are compiled into ebp-0X1C or ebp + var _ 1C;
1.2 incoming parameters of the function: for example, in the form of strm, in, in _ desc, out and out _ desc in the source code int ZEXPORT inflateBack (strm, in, in _ desc, out, out _ desc), X86 architecture binary machine code de-marshals ebp +0X8 or ebp + arg _ 0;
1.3 Return values within the function to call other functions: x86 architecture binary sub _62E8B2F0 as ret in the source code ret (invert _ table.); test eax, the value represented by the eax register in eax, and the like;
1.4 constants (immediate, string, address, etc.): for example, the source code state- > bits ═ 7, the X86 architecture binary machine code is in the form of mov dword ptr [ ecx +54h ] after de-compilation, numeral 7 in 7, and the like.
In expression, a leaf node may contain the following 2 types of data (class 2):
2.1 pointer quantity (variable or constant): the definition or reference of pointer quantity in the source code is represented by [ & or- ], the binary disassembly code comprises dword ptr and middle brackets [ … ], such as dword ptr [ ecx +54h ];
2.2 non-pointer quantity: in contrast to pointer quantities, definitions or references to such quantities do not appear in source code, nor do they appear in disassembled code in dword ptr and parenthesis [ … ].
Combining class 1(1.1-1.4) and class 2(2.1-2.2), leaf nodes contain 4 × 2 ═ 8 types of data.
(3) Comparing specified key variables VsrcAnd VbinThe AST tree of (1). Marking the assignment statements or calculation statements represented by the intermediate nodes and the leaf nodes, and checking whether the basic block in which the statement is located in a loop:
fuzzy alignment marking: the condition i) or ii) is satisfied. Condition i): basic blocks are in the loop, and then marked as fuzzy alignment. Condition ii): if the ancestor node of the node contains the fuzzy comparison mark, the intermediate node is marked to carry out fuzzy comparison;
precise alignment marking: the condition iii) is satisfied. Condition iii): if the basic block is not in the loop and the node does not contain an ancestor node with a fuzzy alignment marker, the marker is an exact alignment.
The specific comparison method is as follows:
a) from VsrcAnd VbinExtracting a root node of the AST and starting comparison;
b) combining the nodes of the AST, if adjacent parent-child nodes on the tree path contain assignment operators or the same operation operators, combining the nodes into the same node, and taking the child node of the node before combination as the child node of the new node after combination;
c) judgment 1: for VsrcAST intermediate node NsrcIf the content is an assignment operator, no comparison is carried out, and the child nodes are deeply traversed; if its content is an operator of an operation, then the slave VbinThe AST depth traversal of (a) finds a node N with the same operation operator progressbinForming a matching pair (N)src,Nbin). V is considered to be the same operation operator once the same operation operator cannot be found or other operation operators are spaced on the AST pathsrcAnd VbinAST extraction is not consistent;
d) and (3) judging: for VsrcAST leaf node NsrcIf it has the exact alignment mark and is a constant type in class 1 and a non-pointer type in class 2, then it needs to be able to be at VbinThe AST depth traversal of (a) finds leaf nodes N with the same constant valuebin(ii) a If it is otherwise the case, only at VbinThe AST depth traversal of (2) finds leaf nodes N belonging to the same class in class 1 and class 2bin. Once such a node cannot be found, V is consideredsrcAnd VbinAST extraction is not consistent;
e) and 3, judgment: for VsrcAST node N'srcIf its contents are operationOperator, search to VbinThe same operation operator node N in ASTbinHas communicated with another node NsrcForm a matching pair (N)src,Nbin) And cannot be coordinated, then V is consideredsrcAnd VbinAST extraction is not consistent; no coordination here means that N 'is not present'binSo that the matching pair group { (N)src,Nbin),(N’src,N’bin) }, or { (N'src,Nbin),(Nsrc,N’bin) } true;
f) when V issrcAnd VbinIf the AST is not consistent after the AST comparison, the AST and the AST are considered as variables with consistent semantics.
In the source code and binary code comparison, the semantic similarity of the target comparison code can be judged according to the semantic consistency judgment of multiple groups of key variables. If multiple sets of key variables are consistent, the source code and binary code are considered semantically similar.
Correspondingly to the above method, the present invention further provides a semantic comparison device between a source code and a binary code, which includes:
an abstract syntax tree generation module for specifying a statement ST from a given source codesrcAnd the key variable V thereinsrcSpecifying a statement ST from binary codebinAnd the key variable V thereinbinAre respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1);
a comparison module for comparing VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether the semantics are consistent;
a similarity judging module for judging the similarity according to the key variable VsrcAnd VbinAnd judging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result.
Further, the abstract syntax tree generating module comprises:
longest path elevatorA module for extracting the basic block B of the statement ST to the function entry basic block B for the key variable VeWhere V includes VsrcAnd Vbin
And the data dependence analysis module is responsible for tracing the data composition source of the key variable V along the longest path P through assignment and calculation statements, and finally forming an abstract syntax tree through extraction and abstraction.
The invention also provides a computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method described above.
The invention also provides a safety detection method of the software code, which comprises the following steps:
a) collecting binary software samples;
b) by adopting the method, the collected binary software sample is compared with the source code in the open source library with the security vulnerability through key variables;
c) determining whether the binary software sample contains the security vulnerability according to the comparison result of the step b).
The invention has the beneficial effects that:
the invention designs a semantic extraction and comparison mode for the source code and the binary code, and can judge and infer the semantic similarity degree between the source code and the binary code according to whether the key variables are consistent in semantics. The semantic design of the invention is to customize the schemes of AST construction, control flow analysis, data flow analysis and other program analysis methods. (1) Compared with the method of compiling the source code project into the binary executable file and carrying out the binary-binary comparison, which is common in public work, the scheme of the invention does not depend on the manual intervention and the compiling process, and can improve the comparison success rate and efficiency. (2) Compared with a source code-binary comparison method in a few public works, the scheme of the invention does not depend on debugging information (or symbol information) output in the compiling process, does not require constants such as character strings and the like in the function, and can specify any source code and key variables in the binary code function for comparison.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
FIG. 2 is an example of disassembled, decompiled code of source code and binary files.
Fig. 3 is a schematic diagram of extraction of the key variable AST in the example of fig. 2.
Fig. 4 is a schematic diagram of the key variable AST in fig. 3 after comparison processing.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
An embodiment of the present invention is shown in fig. 1. Given the source code src and the binary code bin, the key variable V is specified thereinsrcAnd Vbin(1) a first step of extracting V separatelysrcAnd VbinThe longest path from the basic block of the statement to the basic block of the function entry (i.e. the path passing through the basic block with the largest number); (2) secondly, respectively analyzing V in the tracing and classifying modes described in the invention contentsrcAnd VbinData dependence on the longest path, forming an AST; (3) thirdly, judging V in the comparison mode described in the invention contentsrcAnd VbinWhether the semantics of the source code are consistent or not is judged, and the semantic similarity degree between the inferred source code and the binary code is judged according to whether the semantics of the key variables are consistent or not.
To better illustrate the workflow of the present invention, in the following, taking a base library zlib multiplexed by a plurality of software as an example, fig. 2 shows a part of source code ((a) diagram) of an inflateBack function in 1.2.8 version thereof and a part of disassembly code ((b) diagram) and decompilated code ((c) diagram) in a binary executable file. For easy understanding, the source code and the decompilated code are mainly used as an example for comparison, and the key variable is set as a statement ret _ table (CODES, state->lens,19,&(state->next),&(state->lenbits),state->work); state in>lens (namely V)src),V179+112 (i.e., V179+ 112) in if (sub _62E8B2F0(0, V179+112,19, V146, V145, V147)) in decompiled codebin)。
For a given key variable Vsrc=state->lens and VbinAs v179+112, embodiments of the present invention are set forth below:
the first step is as follows: the longest path is extracted. For source code, a tool such as Clang or Antlr can be used to extract basic blocks; for decompiled codes, a tool such as IDA pro or Antlr can be used for extracting basic blocks; for example, the inflataback function extracts 224 basic blocks and their successors after using IDA pro parsing. After the basic blocks are extracted, a connecting edge is established for each pair of basic blocks with subsequent relations, the weight is set to be-1, and the minimum value obtained by using the Floyd-Warshall algorithm is the longest path.
The second step is that: and analyzing data dependence. In FIG. 3, (a) and (b) respectively show VsrcAnd VbinThe AST thus extracted. Note that v179+112 is considered a variable here, rather than v179 being added to constant 112, since v179 defines or is assigned a pointer-type variable, e.g., v179 ═ DWORD (a1+ 28); the plus sign is used here only to fetch the data in memory at the specified offset, and does not correspond to the actual operation in the source code, and therefore will not be considered an operator. As described in the summary, it can be seen that after tracing back the source to an assignment or computation statement, VsrcAnd VbinThe extracted AST each contain 3 leaf nodes.
VsrcThe tracing process comprises the following steps:
√Vsrc=state->lens, by state->Assigning a lens value;
the value of the square assignment statement state- > lens [ order [ state- > have + + ] ] ═ 0, that is, an element in the state- > lens may be equal to 0;
the value assignment statement state- > lens [ order [ state- > have + + ] ] (unsigned short) BITS (3), that is, an element in the state- > lens may be equal to BITS (3), and BITS (3) after macro expansion is number 7;
the flag assignment statement state (struct invert _ state FAR) strm- > state, which indicates that the state is derived from the pointer strm- > state;
the/aforementioned assignment statement can track to 3 leaf nodes, constant-not pointer 0, constant-not pointer BITS (3), and incoming parameter-pointer z _ stream strm.
VbinThe tracing process comprises the following steps:
√Vbinv179+112, i.e. assigned by v179+ 112;
the value of v38 in the v179+2 v38+112 assignment statement _ (WORD) _ (v179+2 v38+112) _ 7 may be 0, so the variable v179+112 may be affected by v179+2 v38+112, assigned the number 7;
the v variable v38 can be traced back to the assignment statement v38 ═ (unsigned __ int16) word _62E98520[ v34+ + ] → v34 ═ DWORD (v179+104) → DWORD (v179+104) → 0, until it is assigned a value of 0;
the v179 variable is a traceable assignment statement v179 ═ DWORD (a1+28), whereas a1 is derived from the incoming parameter a 1;
the/aforementioned assignment statement may track to 3 leaf nodes, constant-not pointer 0, constant-not pointer 7, and incoming parameter-pointer a 1.
The third step: and (6) comparison. As described in the summary of the invention, (a) and (b) in FIG. 4 are each VsrcAnd VbinThe marked and merged AST, after alignment of decision 1,2 and 3, can form a matching pair set of { (incoming parameter-pointer z _ streamstrm, incoming parameter-pointer a1), (constant-non-pointer 0), (constant-non-pointer BITS (3), constant-non-pointer 7) }. In the graphs (a) and (b) of fig. 4, the second text box contains only one equal sign, which means that the first text box contains the right state>Source of lens or v179+ 112.
Checking whether the basic block to which the node statement belongs to a loop, and carrying out fuzzy comparison or accurate comparison on nodes by marking, such as leaf node marks of a graph (a) and a graph (b) in FIG. 4;
carrying out node combination: and if the adjacent parent-child nodes in the tree are both assignment statements or have the same operation operator, merging into the same node.
After the comparison of the intermediate and leaf nodes by the judgments 1,2 and 3, the exact match (constant-not pointer 0), (constant-not pointer BITS (3), constant-not pointer 7) or (constant-not pointer 0, constant-not pointer 7), (constant-not pointer BITS (3), constant-not pointer 0) is obtained (the input parameter-pointer z _ streamstrm, the input parameter-pointer a1), and the semantic inconsistency described in the judgments 1,2 and 3 does not occur.
By comparison, V can be considered assrc=state->lens and VbinThe semantics of v179+112 are consistent.
Further, more such as V may be specifiedsrcAnd VbinThe key variables are compared, and the semantic similarity of the codes is judged. The specific method comprises the following steps: for a source code and a binary code to be compared, firstly, a plurality of key variables are specified in the source code, and if the key variables with similar semantics can be correspondingly found in the binary code, the source code and the binary code are considered to have semantic similarity.
Another embodiment of the present invention provides a semantic comparison apparatus between a source code and a binary code, comprising:
an abstract syntax tree generation module for specifying a statement ST from a given source codesrcAnd the key variable V thereinsrcSpecifying a statement ST from binary codebinAnd the key variable V thereinbinAre respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1);
a comparison module for comparing VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether the semantics are consistent;
a similarity judging module for judging the similarity according to the key variable VsrcAnd VbinAnd judging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result.
Wherein the abstract syntax tree generating module comprises:
the longest path extraction module is responsible for extracting the basic block B of the statement ST where the key variable V is located to a function inletBasic block BeWhere V includes VsrcAnd Vbin
And the data dependence analysis module is responsible for tracing the data composition source of the key variable V along the longest path P through assignment and calculation statements, and finally forming an abstract syntax tree through extraction and abstraction.
Another embodiment of the invention provides a computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method described above.
After semantic similarity between the source code and the binary code is obtained through comparison, the semantic similarity can be further used for discovering security problems in the code. For example, once an open source library of a certain version is reported to have a security vulnerability, a binary software sample can be further collected and a plurality of key variable comparisons can be performed to find a software sample having the open source library vulnerability.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A semantic comparison method between a source code and a binary code is characterized by comprising the following steps:
1) specifying a statement ST from given source codesrcAnd the key variable V thereinsrcSpecifying a statement ST from binary codebinAnd the key variable V thereinbin
2) Are respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1);
3) to VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether semantically consistent;
4) According to the key variable VsrcAnd VbinJudging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result;
wherein, step 3) marks the assignment statement or calculation statement represented by the intermediate node and the leaf node, and checks whether the basic block in which the statement is located in the loop:
fuzzy alignment marking: satisfies the condition i) or ii); wherein the condition i) is: if the basic block is in the circulation, marking the basic block as carrying out fuzzy comparison; the condition ii) is: if the ancestor node of the node contains the fuzzy comparison mark, the intermediate node is marked to carry out fuzzy comparison;
precise alignment marking: the condition iii) is satisfied; the condition iii) is: if the basic block is not in the loop and the node does not contain an ancestor node with a fuzzy alignment marker, the marker is an exact alignment.
2. The method of claim 1, wherein step 2) comprises:
2.1) for the key variable V, extracting the basic block B of the statement ST where the key variable V is to the basic block B of the function inleteWhere V includes VsrcAnd Vbin
2.2) for the key variable V, tracing the data composition source of the key variable V along the longest path P through assignment and calculation statements, and finally forming an abstract syntax tree through extraction and abstraction.
3. Method according to claim 2, characterized in that step 2.1) extracts the control flow constraint of the variable V, whose meaning is as follows: let B1<B2Represents a basic block B1Wherein the existing statement has priority over the basic block B in execution order2The statement in (1); let SeteBasic block B representing function entryeSet of (2): for Be∈SeteIn the absence of any basic block Bi<Be(ii) a Let B-BeRepresenting the basic block B to the entry basic block BeThe length of the longest path is expressed asmax{B-Be},Be∈SeteWherein max { } represents a maximum value; for the basic blocks on the longest path P, backtracking by taking B as a starting point, putting the basic blocks into the path in a way of adding the basic blocks first, and adding each basic block only once without being influenced by circulation until the basic block B at the entranceeUntil the end; and 2.2) tracing the data composition source of the longest path P through assignment and calculation statements, wherein the principle is that the source is traced until the source can not be traced in the function any more, namely under the constraint of the longest path, a variable V is taken as a root node, an assignment operator or a calculation operator is taken as an intermediate node, and finally data which can not be traced in the function is taken as a leaf node.
4. The method of claim 3, wherein the leaf node comprises, from the source: local variables defined in the function, incoming parameters of the function, return values and constants for calling other functions in the function; in terms of expression, the leaf nodes include: pointer quantity, non-pointer quantity.
5. The method of claim 1, wherein the aligning of step 3) comprises:
3.1) from VsrcAnd VbinExtracting a root node of the AST and starting comparison;
3.2) merging the nodes of the AST, if adjacent parent and child nodes on the path of the fruit tree contain assignment operators or the same operation operators, merging into the same node, and taking the child node of the node before merging as the child node of the new node after merging;
3.3) judgment 1: for VsrcAST intermediate node NsrcIf the content is an assignment operator, no comparison is carried out, and the child nodes are deeply traversed; if its content is an operator of an operation, then the slave VbinThe AST depth traversal of (a) finds a node N with the same operation operator progressbinForming a matching pair (N)src,Nbin) (ii) a V is considered to be the same operation operator once the same operation operator cannot be found or other operation operators are spaced on the AST pathsrcAnd VbinAST extraction is not consistent;
3.4) judgment 2: for VsrcAST leaf node NsrcIf it has the exact alignment mark and is a constant type in class 1 and a non-pointer type in class 2, then it needs to be able to be at VbinThe AST depth traversal of (a) finds leaf nodes N with the same constant valuebin(ii) a If it is otherwise the case, only at VbinThe AST depth traversal of (2) finds leaf nodes N belonging to the same class in class 1 and class 2bin(ii) a Once such a node cannot be found, V is consideredsrcAnd VbinAST extraction is not consistent;
3.5) judgment 3: for VsrcAST node N'srcIf its content is an operation operator, V is searchedbinThe same operation operator node N in ASTbinHas communicated with another node NsrcForm a matching pair (N)src,Nbin) And cannot be coordinated, then V is consideredsrcAnd VbinAST extraction is not consistent; no coordination here means that N 'is not present'binSo that the matching pair group { (N)src,Nbin),(N’src,N’bin) }, or { (N'src,Nbin),(Nsrc,N’bin) } true;
3.6) when VsrcAnd VbinIf the AST is not consistent after the AST comparison, the AST and the AST are considered as variables with consistent semantics.
6. The method according to claim 1, wherein through the comparison of the multiple groups of key variables in steps 1) to 3), if the multiple key variables in the source code can correspondingly find key variables with similar semantics in the binary code, the source code and the binary code are considered to have semantic similarity.
7. A semantic comparison device between source code and binary code by using the method of any one of claims 1 to 6, comprising:
an abstract syntax tree generating module for generating an abstract syntax tree,for specifying a statement ST from given source codesrcAnd the key variable V thereinsrcSpecifying a statement ST from binary codebinAnd the key variable V thereinbinAre respectively paired with VsrcAnd VbinPerforming data flow analysis to generate VsrcAbstract syntax tree of (1) and (V)binThe abstract syntax tree of (1);
a comparison module for comparing VsrcAbstract syntax tree of (1) and (V)binComparing the abstract syntax trees to judge VsrcAnd VbinWhether the semantics are consistent;
a similarity judging module for judging the similarity according to the key variable VsrcAnd VbinAnd judging the semantic similarity between the source code and the binary code according to the semantic consistency judgment result.
8. The apparatus of claim 7, wherein the abstract syntax tree generating module comprises:
the longest path extraction module is responsible for extracting the basic block B of the statement ST where the key variable V is located to the function entry basic block BeWhere V includes VsrcAnd Vbin
And the data dependence analysis module is responsible for tracing the data composition source of the key variable V along the longest path P through assignment and calculation statements, and finally forming an abstract syntax tree through extraction and abstraction.
9. A computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 6.
10. A security detection method for software codes is characterized by comprising the following steps:
a) collecting binary software samples;
b) comparing the collected binary software sample with the source code in the open source library with the security vulnerability by adopting the method of any one of claims 1 to 6;
c) determining whether the binary software sample contains the security vulnerability according to the comparison result of the step b).
CN201910249283.6A 2019-03-29 2019-03-29 Semantic comparison method and device between source code and binary code Active CN110147235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910249283.6A CN110147235B (en) 2019-03-29 2019-03-29 Semantic comparison method and device between source code and binary code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910249283.6A CN110147235B (en) 2019-03-29 2019-03-29 Semantic comparison method and device between source code and binary code

Publications (2)

Publication Number Publication Date
CN110147235A CN110147235A (en) 2019-08-20
CN110147235B true CN110147235B (en) 2021-01-01

Family

ID=67588648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910249283.6A Active CN110147235B (en) 2019-03-29 2019-03-29 Semantic comparison method and device between source code and binary code

Country Status (1)

Country Link
CN (1) CN110147235B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078227B (en) * 2019-12-13 2021-08-31 中国科学院信息工程研究所 Binary code and source code similarity analysis method and device based on code characteristics
CN111176993A (en) * 2019-12-24 2020-05-19 中国科学院电子学研究所苏州研究院 Code static detection method based on abstract syntax tree
CN111488155B (en) * 2020-06-15 2020-09-22 完美世界(北京)软件科技发展有限公司 Coloring language translation method
CN112613040A (en) * 2020-12-14 2021-04-06 中国科学院信息工程研究所 Vulnerability detection method based on binary program and related equipment
CN113468525B (en) * 2021-05-24 2023-06-27 中国科学院信息工程研究所 Similar vulnerability detection method and device for binary program
CN114880023B (en) * 2022-07-11 2022-09-30 山东大学 Technical feature oriented source code comparison method, system and program product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436452B2 (en) * 2012-11-12 2016-09-06 Keysight Technologies, Inc. Method for enforcing language subsets
CN106709356A (en) * 2016-12-07 2017-05-24 西安电子科技大学 Static taint analysis and symbolic execution-based Android application vulnerability discovery method
CN107977575A (en) * 2017-12-20 2018-05-01 北京关键科技股份有限公司 A kind of code-group based on privately owned cloud platform is into analysis system and method
CN108108622A (en) * 2017-12-13 2018-06-01 上海交通大学 Leakage location based on depth convolutional network and controlling stream graph
CN109408389A (en) * 2018-10-30 2019-03-01 北京理工大学 A kind of aacode defect detection method and device based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577324B (en) * 2013-10-30 2017-01-18 北京邮电大学 Static detection method for privacy information disclosure in mobile applications

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436452B2 (en) * 2012-11-12 2016-09-06 Keysight Technologies, Inc. Method for enforcing language subsets
CN106709356A (en) * 2016-12-07 2017-05-24 西安电子科技大学 Static taint analysis and symbolic execution-based Android application vulnerability discovery method
CN108108622A (en) * 2017-12-13 2018-06-01 上海交通大学 Leakage location based on depth convolutional network and controlling stream graph
CN107977575A (en) * 2017-12-20 2018-05-01 北京关键科技股份有限公司 A kind of code-group based on privately owned cloud platform is into analysis system and method
CN109408389A (en) * 2018-10-30 2019-03-01 北京理工大学 A kind of aacode defect detection method and device based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向海量软件的未知恶意代码检测方法;陈恺等;《信息安全学报》;20160131;第1卷(第1期);第24-38页 *

Also Published As

Publication number Publication date
CN110147235A (en) 2019-08-20

Similar Documents

Publication Publication Date Title
CN110147235B (en) Semantic comparison method and device between source code and binary code
Jiang et al. Shaping program repair space with existing patches and similar code
CN112733137B (en) Binary code similarity analysis method for vulnerability detection
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
Yakdan et al. No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations.
CN110543421B (en) Unit test automatic execution method based on test case automatic generation algorithm
CN100483434C (en) Method and device for recognizing virus
CN106843840B (en) Source code version evolution annotation multiplexing method based on similarity analysis
CN111400724A (en) Operating system vulnerability detection method, system and medium based on code similarity analysis
CN110554954B (en) Test case selection method combining static dependency and dynamic execution rule
Meng et al. Improving fault localization and program repair with deep semantic features and transferred knowledge
Niere et al. Handling large search space in pattern-based reverse engineering
CN108563561B (en) Program implicit constraint extraction method and system
Solanki et al. Comparative study of software clone detection techniques
Nichols et al. Structural and nominal cross-language clone detection
EP1025492A1 (en) Method for the generation of isa simulators and assemblers from a machine description
US8117604B2 (en) Architecture cloning for power PC processors
KR101583932B1 (en) Signature generation apparatus for generating signature of program and the method, malicious code detection apparatus for detecting malicious code of signature and the method
CN103049504A (en) Semi-automatic instrumentation method based on source code inquiring
Sargsyan et al. Scalable and accurate clones detection based on metrics for dependence graph
Kibria et al. Rtl-fsmx: Fast and accurate finite state machine extraction at the rtl for security applications
Weidl et al. Binding object models to source code: An approach to object-oriented re-architecting
CN115408700A (en) Open source component detection method based on binary program modularization
KR20050065015A (en) System and method for checking program plagiarism
CN116243892B (en) Dynamic JAVA implementation method of decision engine rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant