CN108491228B

CN108491228B - Binary vulnerability code clone detection method and system

Info

Publication number: CN108491228B
Application number: CN201810267094.7A
Authority: CN
Inventors: 姜宇; 杨鑫; 高健; 顾明; 孙家广
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2020-03-17
Anticipated expiration: 2038-03-28
Also published as: CN108491228A

Abstract

The invention provides a binary vulnerability code clone detection method and a system, wherein the method comprises the following steps: extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information; respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected. The method and the system solve the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in implementation and the like of the conventional code clone detection method, ensure the comprehensiveness of clone type detection and the accuracy of detection results, and effectively improve the detection efficiency.

Description

Binary vulnerability code clone detection method and system

Technical Field

The invention relates to the technical field of computer program detection, in particular to a binary vulnerability code clone detection method and a binary vulnerability code clone detection system.

Background

From birth of the software industry to the present, the software industry develops rapidly with the rapid increase of the number of computer users, and has penetrated the aspects of work and life of people. Many software source codes are opened on the internet, and a developer can inquire related codes needed on the internet to be a quick and effective production mode. Copying a section of code, simply modifying or directly copying and pasting the code, and applying the code to a new scene is a common phenomenon in the software development process at present. This manner of code reuse is referred to as code cloning. There are a large number of cloned code segments in large software development. High quality systems recognized in the computer field like the X Windows System still have 19% code cloning; the core part of Linux has 15% -25% code clone; in open source software projects, over 50% of the code is reused. There are a large number of cloned code segments in large software development, for example: the Linux kernel source code 22.3% is called a re-used code, even a small part of the code segment is copied at least 8 times; the Linux kernel of the Debian package (version 2.6.6) confirmed that 145 unpatched vulnerability codes were used. If the cloned code is vulnerable, it may cause a large outbreak of related security problems, since the issued vulnerability may exist in other software, but cannot be completely fixed. By using the information of the published bugs, attackers can use the clone codes which are not patched to cause serious influence on the software system. Therefore, an accurate and efficient method for detecting a cloned vulnerable code is urgently needed.

At present, the cloning codes are generally classified into four types according to the text similarity and the functional similarity of the source codes: 1) code segments that are identical except for spaces and comments; 2) syntactically identical code segments, except for designators, types, spaces, and comments; 3) a copy code segment in which addition/deletion/modification is made to a sentence; 4) functionally identical but syntactically different code segments. Among them, some researchers refer to class 1 as a complete clone,

class

2 and 3 as approximate clones, and class 4 as a semantic clone.

Scholars at home and abroad have proposed a plurality of clone detection methods and technologies and developed corresponding clone detection tools. These methods can be broadly divided into text-based, lexical (token) -based, grammar-based (syntax), semantic-based, etc.

1) Text-based detection methods. The method is to directly perform comparison processing on the source code of the software system (only different points on the annotation and layout of the source code are filtered), and the source code is not converted into some intermediate representation form. Johnson first proposed a text-based clone detection technique: the method comprises the steps of firstly hashing a code segment with a fixed number of rows, then identifying code segments with the same hash value, namely clone codes, by utilizing a delta hash function, and simultaneously finding clone codes with different lengths by using a sliding window technology.

2) A lexical based detection method. The method (also known as token-based method) first converts each line of all source code into a token sequence using lexical analysis tools (e.g., lex) and concatenates all sequences into a token string; this token is then scanned for similar token subsequences, and the source codes corresponding to these similar substrings are reported as clones.

3) A grammar-based detection method. The method is designed according to the fact that similar code segments should also have similar syntactic structures. The program is parsed into a syntax tree in which the source code fragments corresponding to similar subtrees are the clone code. Baxter et al first apply the Abstract Syntax Tree (AST) technique to clone code detection, first parse the source code into syntax trees with labels, then hash the subtrees into N buckets (buckets), then compare the similarities of the subtrees in the same bucket, and then obtain the clone code.

4) A semantic based detection method. The technology mainly takes a Program Dependency Graph (PDG) method as a representative, namely, a program is given, a PDG set is established according to data flow and control dependency relationship among program statements, and code segments corresponding to isomorphic subgraphs in the set are clone codes. In recent years, there have been dynamic analysis methods used by researchers to detect semantically similar code fragments, such as Jiang et al, California university, by giving a set of input data to code fragments, comparing their output results, and obtaining semantically similar clone code. Marcus et al use information retrieval techniques (latent semantic indexing) to statically analyze the source code of a software system to detect semantic clones.

The token-based method can effectively detect the 1 st and 2 nd clones, has low space-time complexity, does not need to consider the correctness of program syntax, is independent of source codes, and has a plurality of false detections when processing the 3 rd clone. The grammar-based approach can effectively detect class 1-3 clones, but the space-time complexity is high because the codes need to be parsed into AST and then similar subtrees are searched. Compared with the grammar-based comparison method, the PDG-based technology analyzes the source code from a higher level to obtain the semantic information of the program, so that the technology can detect some code segments which are disordered and have the same semantic meaning. But the cost for building the PDG and searching for the isomorphic subgraph is also very high, and the PDG is difficult to be applied to large-scale software.

Therefore, the existing detection method for code cloning has the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in implementation and the like.

Disclosure of Invention

The invention provides a binary vulnerability code clone detection method and system, aiming at overcoming the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in implementation and the like of the conventional code clone detection method.

In one aspect, the present invention provides a binary vulnerability code clone detection method, including:

extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function calling information;

respectively inputting the function characteristics of the first function and the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network;

and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

Preferably, according to the fact that the similarity between all first functions and the second functions in the binary code to be detected is smaller than the preset threshold, it is determined that the clone code of the binary vulnerability code does not exist in the binary code to be detected.

Preferably, the calculating the similarity between the first function and the second function by using the preset neural network specifically includes:

respectively obtaining a first sub-feature vector corresponding to each basic block in the first function and a second sub-feature vector corresponding to each basic block in the second function by using the preset neural network according to the basic block information of the first function and the basic block information of the second function;

processing all the first sub-feature vectors according to the control flow information and the function call information of the first function to obtain first feature vectors; processing all the second sub-feature vectors according to the control flow information and the function call information of the second function to obtain second feature vectors;

and calculating the cosine values of the included angles of the first characteristic vector and the second characteristic vector, and determining the cosine values of the included angles as the similarity of the first function and the second function.

Preferably, the basic block information of the first function and the second function respectively includes a starting address of each basic block in the first function and the second function and the number of numerical constants corresponding to each basic block, the number of character constants, the number of branch instructions, the number of function calls, the number of instructions, the number of arithmetic instructions, the number of logic instructions, the betweenness centrality, and the number of child nodes.

Preferably, the control flow information of the first function and the second function includes a dependency relationship between each two basic blocks in the first function and the second function, respectively.

Preferably, the function call information of the first function includes a function start address called by the first function and a function start address calling the first function; the function call information of the second function includes a function start address called by the second function and a function start address calling the second function.

Preferably, the method further comprises:

acquiring a first sample function and a second sample function with marks representing clone/non-clone, and extracting the function characteristics of the first sample function and the function characteristics of the second sample function;

constructing the preset neural network, and setting a target error of the preset neural network;

inputting the function characteristics of the first sample function and the function characteristics of the second sample function into the preset neural network respectively, and training the preset neural network;

and when the difference value between the actual output result and the expected output result of the preset neural network is not greater than the target error, finishing the training of the preset neural network.

Preferably, the obtaining of the first sample function and the second sample function with the marker characterizing clone/non-clone includes:

obtaining a plurality of sample functions, and performing cross compilation of different configurations on each sample function to obtain a plurality of homonymous functions and a plurality of non-homonymous functions;

and combining every two homonymous functions into a first sample function and a second sample function with marks for characterizing clones, and combining every two non-homonymous functions into a first sample function and a second sample function with marks for characterizing non-clones.

In one aspect, the present invention provides a binary vulnerability code clone detection system, including:

the characteristic extraction module is used for extracting the function characteristics of a first function in the binary code to be detected and the function characteristics of a second function in the binary vulnerability code, wherein the function characteristics comprise basic block characteristics, control flow information and function call information;

the similarity calculation module is used for respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network and calculating the similarity of the first function and the second function by using the preset neural network;

and the clone detection module is used for determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected when the similarity reaches a preset threshold value.

In one aspect, the present invention provides a device for a binary vulnerability code clone detection method, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor being capable of performing any of the methods described above when invoked by the processor.

The invention provides a binary vulnerability code clone detection method and system, which comprises the steps of extracting the function characteristics of a first function in a binary code to be detected and the function characteristics of a second function in the binary vulnerability code; respectively inputting the function characteristics of the first function and the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected. The method and the system can accurately realize the clone detection of the vulnerability code on the basis of the binary system, do not need to obtain a source code, and have universal applicability; meanwhile, the code is input into the neural network with the basic block as the fine granularity for deep learning, so that the clone detection of the code is realized, the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the conventional code clone detection method are solved, the comprehensiveness of clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

Drawings

Fig. 1 is a schematic overall flow chart of a binary vulnerability code clone detection method according to an embodiment of the present invention;

FIG. 2 is a schematic overall flow chart of a function similarity calculation method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram illustrating a calculation process of a feature vector of a function in a neural network according to an embodiment of the present invention;

FIG. 4 is a schematic overall flowchart of a training process of a neural network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an overall structure of a binary vulnerability code clone detection system according to an embodiment of the present invention;

fig. 6 is a schematic structural framework diagram of a device of a binary vulnerability code clone detection method according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

It should be noted that, in the research of the existing clone code analysis technology, most of the research is performed based on the source code, and the research based on the binary code is less. However, in some cases, we cannot obtain the source code, and if most business software does not issue the source code, it is very important to use the binary file for similarity detection. In view of this, the present invention provides a binary vulnerability code clone detection method, which includes: extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information; respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

Referring to fig. 1, fig. 1 is a schematic overall flow chart of a binary vulnerability code clone detection method according to an embodiment of the present invention, as shown in fig. 1, the binary vulnerability code clone detection method includes:

s101, extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information;

it should be noted that the detection method in this embodiment is based on binary code form detection, and when the code to be detected is a source code, the source code needs to be converted into a binary code. In addition, the existing vulnerability codes are generally single functions, that is, only one function is generally available in the existing vulnerability codes. Therefore, when the code to be detected includes a plurality of functions, the functions in the code to be detected should be split first, and then each function is detected in sequence. Meanwhile, because various vulnerability codes exist in the existing public vulnerability surrogate library, when all vulnerability codes in the existing public vulnerability surrogate library need to be detected in the codes to be detected, each vulnerability code needs to be detected respectively, and the detection process of one function in the codes to be detected is taken as an example for explanation:

firstly, acquiring a binary code to be detected and a binary vulnerability code through a compiler, extracting a function in the binary code to be detected by using a binary code disassembling tool, and taking the function as a first function; and simultaneously, extracting a function in the binary vulnerability code by using a binary code disassembling tool, and taking the function as a second function. Meanwhile, a function feature of the first function and a function feature of the second function are extracted by using a binary code disassembling tool. The function characteristics comprise basic block information, control flow information and function calling information, wherein the basic block information is related information of basic blocks contained in one function, the control flow information is a dependency relationship among the basic blocks in one function, and the function calling information is the condition that one function calls other functions and the function is called by other functions. It can be seen that a function can be fully described by basic block information, control flow information and function call information.

S102, respectively inputting the function characteristics of the first function and the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network;

specifically, on the basis of obtaining the function characteristics of the first function and the second function, the function characteristics of the first function and the second function are respectively input into a preset neural network, wherein the preset neural network is trained in advance, the preset neural network calculates the similarity between the first function and the second function according to the function characteristics of the first function and the function characteristics of the second function, and outputs the calculation result of the similarity through an output layer of the preset neural network.

S103, when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

Specifically, on the basis of the similarity between the first function and the second function obtained by the calculation, the similarity obtained by the calculation is compared with a preset threshold value, and the preset threshold value is a preset critical value of the similarity. When the similarity obtained by calculation reaches a preset threshold value, the clone code of the binary vulnerability code in the binary code to be detected can be determined.

The invention provides a binary vulnerability code clone detection method, which comprises the steps of extracting the function characteristics of a first function in a binary code to be detected and the function characteristics of a second function in the binary vulnerability code; respectively inputting the function characteristics of the first function and the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected. The method can accurately realize the clone detection of the vulnerability code on the basis of the binary system, does not need to obtain a source code, and has universal applicability; meanwhile, the code is input into the neural network with the basic block as the fine granularity for deep learning, so that the clone detection of the code is realized, the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the conventional code clone detection method are solved, the comprehensiveness of clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

Based on any one of the embodiments, a binary vulnerability code clone detection method is provided, and according to the fact that the similarity between all first functions and all second functions in a binary code to be detected is smaller than a preset threshold value, clone codes without binary vulnerability codes in the binary code to be detected are determined.

Specifically, when the binary code to be detected includes a plurality of first functions, the method in the above embodiment should be used to calculate the similarity between each first function in the binary code to be detected and the second function in the binary code vulnerability code. And when the similarity between all the first functions in the binary codes to be detected and the second functions in the binary vulnerability codes is smaller than a preset threshold value, determining that the clone codes of the binary vulnerability codes do not exist in the binary codes to be detected.

According to the binary vulnerability code clone detection method provided by the invention, clone codes without binary vulnerability codes in the binary codes to be detected are determined according to the condition that the similarity between all first functions and all second functions in the binary codes to be detected is smaller than a preset threshold value. The method can accurately realize the clone detection of the vulnerability code on the basis of the binary system, does not need to obtain a source code, and has universal applicability; meanwhile, the code is input into the neural network with the basic block as the fine granularity for deep learning, so that the clone detection of the code is realized, the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the conventional code clone detection method are solved, the comprehensiveness of clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

Based on any of the above embodiments, with reference to fig. 2, the step S102 of calculating the similarity between the first function and the second function by using a preset neural network specifically includes:

s1021, respectively obtaining a first sub-feature vector corresponding to each basic block in the first function and a second sub-feature vector corresponding to each basic block in the second function according to the basic block information of the first function and the basic block information of the second function by using a preset neural network;

specifically, function characteristics of the first function and function characteristics of the second function are input into the preset neural network through an input layer of the preset neural network, wherein the function characteristics comprise basic block information, control flow information and function call information. On the basis, the preset neural network receives basic block information of the first function, the basic block information of the first function comprises a starting address of each basic block in the first function, the starting address is used for identifying each basic block, and the unstructured features and the structured features of each basic block. The preset neural network obtains a first sub-feature vector corresponding to each basic block in the first function according to the basic block information of the first function. Similarly, the preset neural network receives basic block information of the second function, wherein the basic block information of the second function comprises a starting address of each basic block in the second function, the starting address is used for identifying each basic block, and the unstructured feature and the structured feature of each basic block. And the preset neural network obtains a second sub-feature vector corresponding to each basic block in the second function according to the basic block information of the second function.

S1022, processing all the first sub-feature vectors according to the control flow information and the function call information of the first function to obtain first feature vectors; processing all second sub-feature vectors according to the control flow information and the function call information of the second function to obtain second feature vectors;

specifically, the obtained first sub-feature vector corresponding to each basic block in the first function and the obtained second sub-feature vector corresponding to each basic block in the second function are input into a hidden layer and a full connection layer of a preset neural network, and the hidden layer and the full connection layer respectively process all the first sub-feature vectors according to the input control flow information and function call information of the first function to obtain a first feature vector; and the hidden layer and the full connection layer respectively process all the second sub-feature vectors according to the input control flow information and function call information of the second function to obtain second feature vectors.

When processing all the first sub-feature vectors, the neuron of the hidden layer determines the dependency relationship among all the first sub-feature vectors according to the control flow information of the first function, and if no other sub-feature vector depends on the sub-feature vector corresponding to a certain neuron, the neuron only needs to multiply the value of the neuron corresponding to the previous hidden layer with a preset first coefficient matrix when processing the corresponding sub-feature vector; if there is a sub-feature vector that depends on a neuron, the neuron processes the sub-feature vector corresponding to the neuron, and adds a product of the value of the neuron corresponding to the sub-feature vector depending on the sub-feature vector in the previous hidden layer and a preset second coefficient matrix on the basis of multiplying the value of the neuron corresponding to the previous hidden layer by a preset first coefficient matrix.

Based on the principle, after all the first sub-feature vectors are processed by the hidden layer, all the processed first sub-feature vectors are integrated and then input into the full connection layer, the full connection layer determines corresponding function call sub-feature vectors according to function call information of the first function, and finally, all the integrated first sub-feature vectors and the corresponding function call sub-feature vectors are spliced to obtain the first feature vectors corresponding to the first function. Similarly, after all the second sub-feature vectors are processed by the hidden layer, all the processed second sub-feature vectors are integrated and then input into the full connection layer, the full connection layer determines corresponding function call sub-feature vectors according to the function call information of the second function, and finally, all the integrated second sub-feature vectors and the corresponding function call sub-feature vectors are spliced to obtain the second feature vectors corresponding to the second function.

And S1023, calculating the cosine values of the included angles of the first characteristic vector and the second characteristic vector, and determining the cosine values of the included angles as the similarity of the first function and the second function.

Specifically, after the first eigenvector corresponding to the first function and the second eigenvector corresponding to the second function are obtained, the full connection layer of the preset neural network still needs to calculate the cosine value of the included angle between the first eigenvector and the second eigenvector, and finally the cosine value of the included angle obtained by calculation is output through the output layer of the preset neural network, and the cosine value of the output included angle can be determined as the similarity between the first function and the second function. The similarity range is (-1,1), when the similarity of the first function and the second function is close to 1, the first function and the second function can be determined to be clone functions, and when the similarity of the first function and the second function is close to-1, the first function and the second function can be determined to be non-clone functions.

The invention provides a binary vulnerability code clone detection method, which comprises the steps of respectively obtaining a first sub-feature vector corresponding to each basic block in a first function and a second sub-feature vector corresponding to each basic block in a second function by utilizing a preset neural network according to basic block information of the first function and basic block information of the second function; processing all the first sub-feature vectors according to the control flow information and the function call information of the first function to obtain first feature vectors; processing all second sub-feature vectors according to the control flow information and the function call information of the second function to obtain second feature vectors; and calculating the cosine value of the included angle between the first characteristic vector and the second characteristic vector, and determining the cosine value of the included angle as the similarity of the first function and the second function. The method aims at deep learning of codes which are input into a neural network with fine granularity of basic blocks, so that the clone detection of the codes is realized, the problems that the clone type detection is not comprehensive, the accuracy is low, the complexity is high, the method is not easy to realize and the like in the conventional code clone detection method are solved, the comprehensiveness of the clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

To facilitate an understanding of the method in any of the embodiments, the following examples are now described:

fig. 3 is a schematic structural diagram of a calculation process of a feature vector of a function in a neural network according to an embodiment of the present invention, and the calculation process of a single feature vector of the function is illustrated as an example in the diagram. The function in the figure has 3 basic blocks (X)₁、X₂、X₃) In the figure, 7 hidden layers are included, and firstly, three neurons u in the hidden layer of the first layer are subjected to₁ ⁰、u₂ ⁰、u₃ ⁰Initialization is performed, and the initial values are 64-dimensional all-0 vectors. The subscripts of the neurons correspond to the subscripts of the basic blocks one to one, namely the basic blocks X₁The corresponding first layer hidden layer neuron is

And then, the values of the neurons of the hidden layer of the next layer are calculated according to the formula shown below.

In the above formula, the first and second carbon atoms are,

the representation depends on the basic block X_iThe subscripts, nodes,

representation and basic block X_iCorresponding t-th layer hidden unit, P (i) represents the node calling the function, S (i) represents the node called by the function, W₁,W₂,W₃,P₁,P₂For presetting a parameter matrix, x, of the neural network_iRepresents a basic block X_iThe sub-feature vector of (a) · is the activation function.

As can be seen from the above, in fig. 3,

neurons of the second layer hidden layer:

and performing calculation by analogy, wherein mu obtained by final calculation is the feature vector corresponding to the function, and in the embodiment, mu is a 64-dimensional feature vector.

Similarly, the feature vector of another function is calculated according to the above calculation method, and finally the cosine value of the included angle between the feature vectors of the two functions is calculated, so that the similarity of the two functions can be determined, and further, whether the two functions are clone functions can be determined.

The method for detecting cloning of binary vulnerability codes is provided based on any one of the above embodiments, where the basic block information of the first function and the second function respectively includes the starting address of each basic block in the first function and the second function and the number of numerical constants corresponding to each basic block, the number of character constants, the number of branch instructions, the number of function calls, the number of instructions, the number of arithmetic instructions, the number of logic instructions, the betweenness centrality, and the number of child nodes.

Specifically, the basic block information of the first function includes a start address of each basic block in the first function, and the start address of each basic block is an identifier of the basic block and is used for uniquely determining the basic block. In addition, the basic block information of the first function further includes 7 unstructured features of the number of numerical constants, the number of character constants, the number of branch instructions, the number of function calls, the number of instructions, the number of arithmetic instructions and the number of logic instructions corresponding to each basic block, and 2 structured features of the betweenness centrality and the number of child nodes corresponding to each basic block. The 7 unstructured features and the 2 structured features combine to form a 9-dimensional vector for each basic block.

Similarly, the basic block information of the second function includes a start address of each basic block in the second function, and the start address of each basic block is an identifier of the basic block and is used for uniquely determining the basic block. In addition, the basic block information of the second function further includes 7 unstructured features of the number of numerical constants, the number of character constants, the number of branch instructions, the number of function calls, the number of instructions, the number of arithmetic instructions and the number of logic instructions corresponding to each basic block, and 2 structured features of the betweenness centrality and the number of child nodes corresponding to each basic block. The 7 unstructured features and the 2 structured features combine to form a 9-dimensional vector for each basic block.

As can be seen from the above, the preset neural network can obtain the first sub-feature vector corresponding to each basic block in the first function and the second sub-feature vector corresponding to each basic block in the second function according to the basic block information of the first function and the basic block information of the second function, where the first sub-feature vector and the second sub-feature vector are the 9-dimensional vectors.

According to the binary vulnerability code clone detection method provided by the invention, the basic block information of the first function and the basic block information of the second function are obtained, the basic blocks are used as fine granularity to be input into the neural network for deep learning, the clone detection of the codes is further realized, the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the conventional code clone detection method are solved, the comprehensiveness of clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

Based on any of the above embodiments, a binary vulnerability code clone detection method is provided, where control flow information of a first function and a second function respectively includes a dependency relationship between every two basic blocks in the first function and the second function.

Specifically, the control flow information of the first function includes a dependency relationship between every two basic blocks in the first function, and the control flow information of the second function includes a dependency relationship between every two basic blocks in the second function. The preset neural network respectively performs calculation processing according to the control flow information of the first function and the control flow information of the second function, and finally outputs a first feature vector corresponding to the first function and a second feature vector corresponding to the second function. In addition, the dependency relationship may be represented in a form of a dependency topology diagram or an array, and may be set according to actual requirements, which is not specifically limited herein.

According to the binary vulnerability code clone detection method provided by the invention, the preset neural network is facilitated to respectively carry out calculation processing according to the control flow information of the first function and the control flow information of the second function by acquiring the control flow information of the first function and the control flow information of the second function, and finally, the first eigenvector corresponding to the first function and the second eigenvector corresponding to the second function are output, so that the clone detection of the code is realized, the problems that the clone type detection is incomplete, the accuracy is low, the complexity is high, the realization is difficult and the like in the conventional code clone detection method are solved, the comprehensiveness of the clone type detection and the accuracy of the detection result are ensured, and the detection efficiency is effectively improved.

Based on any one of the embodiments, a binary vulnerability code clone detection method is provided, wherein function call information of a first function includes a function start address called by the first function and a function start address calling the first function; the function call information of the second function includes a function start address called by the second function and a function start address calling the second function.

Specifically, the function call information of the first function includes a function start address of the first function call and a function start address of the first function call, that is, the call and called relationships of the first function and other functions can be determined. The function calling information of the second function comprises a function starting address of the second function call and a function starting address of the second function call, and therefore the calling and called relations of the second function and other functions can be determined. The starting address of the function is the unique identifier of the function and is used for uniquely determining a certain function. And the preset neural network respectively performs calculation processing according to the function calling information of the first function and the function calling information of the second function, and finally outputs a first feature vector corresponding to the first function and a second feature vector corresponding to the second function.

According to the binary vulnerability code clone detection method provided by the invention, the function call information of the first function and the function call information of the second function are obtained, the preset neural network is facilitated to respectively carry out calculation processing according to the function call information of the first function and the function call information of the second function, and finally, the first feature vector corresponding to the first function and the second feature vector corresponding to the second function are output, so that the clone detection of the code is realized, the problems that the clone type detection is incomplete, the accuracy is low, the complexity is high, the realization is difficult and the like in the conventional code clone detection method are solved, the comprehensiveness of the clone type detection and the accuracy of the detection result are ensured, and the detection efficiency is effectively improved.

Based on any of the above embodiments, a binary vulnerability code clone detection method is provided, referring to fig. 4, the method further includes a step of training a preset neural network, and the specific process is as follows:

s401, acquiring a first sample function and a second sample function with marks representing clone/non-clone, and extracting the function characteristics of the first sample function and the second sample function;

specifically, obtaining the first and second sample functions with the markers characterizing clones/non-clones includes obtaining the first and second sample functions characterizing clone markers and the first and second sample functions characterizing non-clone markers. Wherein, the first sample function and the second sample function with the characterization clone label mean that the first sample function and the second sample function are clone functions with each other; the first sample function and the second sample function with characterizing non-clonal markers means that the first sample function and the second sample function are each a non-clonal function. In this embodiment, a plurality of sets of first sample functions and second sample functions with markers representing clone/non-clone may be obtained to form a plurality of training samples to train the preset neural network, and the training samples may be set according to actual requirements, which is not specifically limited herein.

And extracting the function characteristics of the first sample function and the function characteristics of the second sample function by using a binary code disassembling tool. The function characteristics comprise basic block information, control flow information and function calling information, wherein the basic block information is related information of basic blocks contained in one function, the control flow information is a dependency relationship among the basic blocks in one function, and the function calling information is the condition that one function calls other functions and the function is called by other functions. It can be seen that a function can be fully described by basic block information, control flow information and function call information.

S402, constructing a preset neural network, and setting a target error of the preset neural network;

specifically, a model of the neural network is constructed, and in this embodiment, the model of the neural network includes an input layer, a hidden layer, a fully-connected layer, and an output layer. In other embodiments, the model of the preset neural network may be constructed according to actual requirements, and is not specifically limited herein. On the basis of constructing the preset neural network, a target error of the preset neural network is set, wherein the target error is a target value of an error between an actual output value and an expected output value of the preset neural network. The target error value may be set according to actual requirements, and is not specifically limited herein.

S403, respectively inputting the function characteristics of the first sample function and the function characteristics of the second sample function into a preset neural network, and training the preset neural network;

specifically, on the basis of obtaining the function characteristic of the first sample function and the function characteristic of the second sample function, the function characteristic of the first sample function and the function characteristic of the second sample function are respectively input into the preset neural network, and the preset neural network is trained.

S404, when the difference value between the actual output result and the expected output result of the preset neural network is not larger than the target error, the preset neural network training is finished.

Specifically, in the process of training the preset neural network by using the first sample function and the second sample function, when the difference between the actual output result and the expected output result of the preset neural network is not greater than the target error, the preset neural network training is ended.

The invention provides a binary vulnerability code clone detection method, which comprises the steps of obtaining a first sample function and a second sample function with a mark for representing clone/non-clone, and extracting the function characteristics of the first sample function and the second sample function; constructing a preset neural network, and setting a target error of the preset neural network; respectively inputting the function characteristics of the first sample function and the function characteristics of the second sample function into a preset neural network, and training the preset neural network; and when the difference value between the actual output result and the expected output result of the preset neural network is not larger than the target error, finishing the training of the preset neural network. The method is beneficial to realizing clone detection of the code by utilizing the trained preset neural network by training the preset neural network, solves the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the traditional code clone detection method, ensures the comprehensiveness of clone type detection and the accuracy of a detection result, and effectively improves the detection efficiency.

Based on any one of the above embodiments, a binary vulnerability code clone detection method is provided, in which a first sample function and a second sample function with a mark representing clone/non-clone are obtained, and the method specifically includes:

every two homonymic functions are combined into a first sample function and a second sample function with the marks representing the clones, and every two non-homonymic functions are combined into a first sample function and a second sample function with the marks representing the non-clones.

It should be noted that, since the binary codes obtained by compiling the same function by different compilers may be different in form, in order to ensure the accuracy of the predetermined neural network, an appropriate training sample is selected. This example obtained a first sample function and a second sample function with markers characterizing clones/non-clones by:

specifically, in the process of obtaining the first sample function and the second sample function with the mark representing clone/non-clone, a plurality of sample functions are obtained first, and the number of sample functions may be set according to actual requirements, which is not specifically limited herein. And performing cross compilation of different configurations on each sample function to obtain a plurality of homonymous functions and a plurality of non-homonymous functions. Although the binary codes obtained by the same function under compilation by different compilers may differ in form, their corresponding functions have the same name. Namely, when the function names of the two functions are the same, the two functions can be determined to be clone functions; when the function names of the two functions are not the same, the two functions may be determined to be non-clonal functions.

Further, on the basis of obtaining the plurality of homonymous functions and the plurality of non-homonymous functions, every two homonymous functions are combined into a first sample function and a second sample function with marks for representing clones, and every two non-homonymous functions are combined into a first sample function and a second sample function with marks for representing non-clones.

After the obtained first sample function and the second sample function with the mark representing clone/non-clone are adopted to train the preset neural network, even if two binary codes are different forms of binary codes obtained by the same function under the compiling of different compilers, the preset neural network can identify the two binary codes as clone codes.

According to the binary vulnerability code clone detection method provided by the invention, the preset neural network is trained by acquiring the appropriate first sample function and second sample function with the mark representing clone/non-clone, so that the preset neural network can identify different forms of binary codes obtained by the same function under the compiling of different compilers, the accuracy of the preset neural network is favorably ensured, the clone detection of the codes is favorably realized by utilizing the trained preset neural network, the problems that the clone type detection is incomplete, the accuracy is low, the complexity is high, the realization is difficult and the like in the existing code clone detection method are solved, the comprehensiveness of the clone type detection and the accuracy of the detection result are ensured, and the detection efficiency is effectively improved.

Fig. 5 is a schematic diagram of an overall structure of a binary vulnerability code clone detection system according to an embodiment of the present invention, and as shown in fig. 5, the present invention provides a binary vulnerability code clone detection system, which includes a feature extraction module 1, a similarity calculation module 2, and a clone detection module 3, and the binary vulnerability code clone detection method according to any of the above embodiments is implemented through cooperation of the modules, and is specifically implemented as follows:

the characteristic extraction module 1 is used for extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block characteristics, control flow information and function call information;

specifically, a compiler is used for firstly obtaining a binary code to be detected and a binary vulnerability code, a feature extraction module 1 is used for extracting a function in the binary code to be detected by adopting a binary code disassembling tool, and the function is used as a first function; and simultaneously, extracting a function in the binary vulnerability code by using a binary code disassembling tool by using a feature extraction module 1, and taking the function as a second function. Meanwhile, the feature extraction module 1 is used for extracting the function features of the first function and the function features of the second function by using a binary code disassembling tool. The function characteristics comprise basic block information, control flow information and function calling information, wherein the basic block information is related information of basic blocks contained in one function, the control flow information is a dependency relationship among the basic blocks in one function, and the function calling information is the condition that one function calls other functions and the function is called by other functions. It can be seen that a function can be fully described by basic block information, control flow information and function call information.

The similarity calculation module 2 is used for respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network;

specifically, on the basis of obtaining the function feature of the first function and the function feature of the second function, the similarity calculation module 2 is used for inputting the function feature of the first function and the function feature of the second function into a preset neural network respectively, wherein the preset neural network is trained in advance, the preset neural network calculates the similarity between the first function and the second function according to the function feature of the first function and the function feature of the second function, and outputs the calculation result of the similarity through an output layer of the preset neural network.

And the clone detection module 3 is used for determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected when the similarity reaches a preset threshold value.

Specifically, on the basis of the similarity between the first function and the second function obtained through the calculation, the similarity obtained through the calculation is compared with a preset threshold value by using the clone detection module 3, wherein the preset threshold value is a preset critical value of the similarity. When the similarity obtained by calculation reaches a preset threshold value, the clone code of the binary vulnerability code in the binary code to be detected can be determined.

It should be noted that, when the binary code to be detected includes a plurality of first functions, the method in the foregoing embodiment should be used to calculate the similarity between each first function in the binary code to be detected and the second function in the binary code vulnerability code. And when the similarity between all the first functions in the binary codes to be detected and the second functions in the binary vulnerability codes is smaller than a preset threshold value, determining that the clone codes of the binary vulnerability codes do not exist in the binary codes to be detected.

The invention provides a binary vulnerability code clone detection system, which extracts the function characteristics of a first function in a binary code to be detected and the function characteristics of a second function in the binary vulnerability code; respectively inputting the function characteristics of the first function and the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected. The system can accurately realize the clone detection of the vulnerability code on the basis of the binary system, does not need to obtain a source code, and has universal applicability; meanwhile, the code is input into the neural network with the basic block as the fine granularity for deep learning, so that the clone detection of the code is realized, the problems of incomplete clone type detection, low accuracy, high complexity, difficulty in realization and the like of the conventional code clone detection method are solved, the comprehensiveness of clone type detection and the accuracy of a detection result are ensured, and the detection efficiency is effectively improved.

Fig. 6 shows a block diagram of a device of a binary vulnerability code clone detection method according to an embodiment of the present invention. Referring to fig. 6, the apparatus of the binary vulnerability code clone detection method includes: a processor (processor)61, a memory (memory)62, and a bus 63; wherein, the processor 61 and the memory 62 complete the communication with each other through the bus 63; the processor 61 is configured to call program instructions in the memory 62 to perform the methods provided by the above-mentioned method embodiments, for example, including: extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information; respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information; respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function call information; respectively inputting the function characteristics of the first function and the function characteristics of the second function into a preset neural network, and calculating the similarity of the first function and the second function by using the preset neural network; and when the similarity reaches a preset threshold value, determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus and the like of the binary vulnerability code clone detection method are merely illustrative, where the units described as the separate components may or may not be physically separate, and the components displayed as the units may or may not be physical units, that is, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A binary vulnerability code clone detection method is characterized by comprising the following steps:

extracting function characteristics of a first function in a binary code to be detected and function characteristics of a second function in a binary vulnerability code, wherein the function characteristics comprise basic block information, control flow information and function calling information; the control flow information of the first function and the second function respectively comprises the dependency relationship between every two basic blocks in the first function and the second function;

when the similarity reaches a preset threshold value, determining that a clone code of the binary vulnerability code exists in the binary code to be detected;

the method further comprises the following steps:

when the difference value between the actual output result and the expected output result of the preset neural network is not larger than the target error, the preset neural network training is finished;

the obtaining of the first sample function and the second sample function with the markers characterizing the clone/non-clone includes:

2. The method according to claim 1, wherein it is determined that there is no clone code of the binary vulnerability code in the binary code to be detected according to the fact that the similarity between all the first functions and the second functions in the binary code to be detected is smaller than the preset threshold.

3. The method according to claim 1, wherein the calculating the similarity between the first function and the second function using the predetermined neural network specifically comprises:

4. The method according to claim 1 or 3, wherein the basic block information of the first function and the second function respectively comprises a starting address of each basic block in the first function and the second function and the number of numerical constants corresponding to each basic block, the number of character constants, the number of branch instructions, the number of function calls, the number of arithmetic instructions, the number of logic instructions, the betweenness centrality and the number of child nodes.

5. The method according to claim 1 or 3, wherein the function call information of the first function includes a function start address called by the first function and a function start address calling the first function; the function call information of the second function includes a function start address called by the second function and a function start address calling the second function.

6. A binary vulnerability code clone detection system, comprising:

the characteristic extraction module is used for extracting the function characteristics of a first function in the binary code to be detected and the function characteristics of a second function in the binary vulnerability code, wherein the function characteristics comprise basic block characteristics, control flow information and function call information; the control flow information of the first function and the second function respectively comprises the dependency relationship between every two basic blocks in the first function and the second function;

the clone detection module is used for determining that the clone codes of the binary vulnerability codes exist in the binary codes to be detected when the similarity reaches a preset threshold value;

a sample function obtaining module, configured to obtain a first sample function and a second sample function with a marker representing clone/non-clone, and extract a function feature of the first sample function and a function feature of the second sample function;

the neural network construction module is used for constructing the preset neural network and setting a target error of the preset neural network;

the neural network training module is used for respectively inputting the function characteristics of the first sample function and the function characteristics of the second sample function into the preset neural network and training the preset neural network;

7. The equipment of the binary vulnerability code clone detection method is characterized by comprising the following steps:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.