CN111552969A

CN111552969A - Embedded terminal software code vulnerability detection method and device based on neural network

Info

Publication number: CN111552969A
Application number: CN202010319183.9A
Authority: CN
Inventors: 朱朝阳; 颜秉晶; 周亮; 王海翔; 冀晓宇; 徐文渊; 应欢; 张燕秒; 卢新岱; 韩丽芳; 缪思薇; 朱亚运; 李霁远
Original assignee: Zhejiang University ZJU; State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd; China Electric Power Research Institute Co Ltd CEPRI
Current assignee: Zhejiang University ZJU; State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd; China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-18

Abstract

The invention provides a method and a device for detecting embedded terminal software code bugs based on a neural network, wherein the method comprises the following steps: firstly, acquiring a source code of target embedded terminal software, and preprocessing the source code to obtain a binary code; then inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code; inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a Siam network for similarity comparison to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library; and finally, determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result. The invention increases the detection universality by converting the source code into the binary code through preprocessing, overcomes the limitation of a graph matching algorithm by generating an attribute control flow graph based on a neural network, and improves the detection efficiency.

Description

Embedded terminal software code vulnerability detection method and device based on neural network

Technical Field

The invention relates to the technical field of intelligent power grid security, in particular to a method and a device for detecting embedded terminal software code bugs based on a neural network.

Background

The embedded technology plays a vital role in the power system, and the embedded equipment (i.e., the embedded terminal) applied to the power system has the characteristics of various types and diversified architectures. These embedded devices provide various functions and are also prone to safety hazards. Research has shown that more than 80.4% of embedded device firmware contains multiple N-day bugs when released by a vendor, and even bugs that have been disclosed for more than 8 years. Because the embedded device code has the characteristics of homology and slow update, a vulnerability at a source code level may be propagated to other embedded devices of hundreds or more different hardware architectures and software platforms, and may exist in the devices for a long time. Therefore, how to detect known vulnerabilities in embedded devices becomes a key to secure the embedded devices. Similarity research and detection aiming at embedded device codes under different platforms are effective methods for identifying disclosed vulnerabilities.

Currently, there are two main studies on embedded device code similarity detection. One is to scan specific character strings or constants in the firmware of the embedded device, and the method cannot correctly identify when a complex bug is encountered or different character strings or constants are bound by the bug. The other method is a method based on graph matching, which excessively depends on an algorithm of graph matching, so that pertinence judgment on similarity cannot be carried out when similar codes appear but output and input are different, and the graph matching algorithm is limited by the image matching speed and has low operation efficiency.

Disclosure of Invention

The invention aims to provide a method and a device for detecting the vulnerability of an embedded terminal software code based on a neural network, so as to solve the technical problems that in the prior art, character strings or constants cannot be correctly identified, and the judgment result is inaccurate and the operation efficiency is low due to excessive dependence on a graph matching algorithm.

In a first aspect, an embodiment of the present invention provides a method for detecting a bug of an embedded terminal software code based on a neural network, where the method includes: acquiring a source code of target embedded terminal software, and preprocessing the source code to obtain a binary code; inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code; comparing the similarity of the attribute control flow graph of the binary code and the attribute control flow graph of the vulnerability code to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library; and determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

Further, comparing the similarity of the attribute control flow graph of the binary code and the attribute control flow graph of the bug code to obtain a comparison result, wherein the comparison result comprises: inputting the attribute control flow graph of the binary code into a Siamese network to obtain a first characteristic vector; wherein the first feature vector is used for representing structural information of an attribute control flow graph of the binary code; inputting the attribute control flow graph of the vulnerability code into the Siamese network to obtain a second characteristic vector; the second feature vector is used for representing structural information of an attribute control flow graph of the vulnerability code; and carrying out similarity comparison on the first characteristic vector and the second characteristic vector to obtain a comparison result.

Further, performing similarity comparison on the first feature vector and the second feature vector to obtain the comparison result includes: calculating cosine distances of the first feature vector and the second feature vector; determining the comparison result based on the cosine distance.

Further, before the similarity comparison is performed between the attribute control flow graph of the binary code and the attribute control flow graph of the vulnerability code, the method further includes: acquiring a vulnerability training sample and/or a non-vulnerability training sample; the vulnerability training sample comprises an attribute control flow graph of a vulnerability code, and the non-vulnerability training sample comprises an attribute control flow graph of a binary code; and training a Siamese network based on the vulnerability training sample and/or the non-vulnerability training sample, and optimizing a target function of the Siamese network by a quasi-Newton method in the process of training the Siamese network to obtain the optimized Siamese network.

Further, obtaining the vulnerability training samples and/or the non-vulnerability training samples includes: obtaining an original vulnerability training sample and/or an original non-vulnerability training sample; indexing the original vulnerability training samples by using a distributed indexing method of position sensitive Hash to obtain clustering clusters of vulnerability training samples, and/or indexing the original non-vulnerability training samples by using a distributed indexing method of position sensitive Hash to obtain clustering clusters of non-vulnerability training samples; and acquiring a vulnerability training sample from the clustering cluster of the vulnerability training sample, and/or acquiring a non-vulnerability training sample from the clustering cluster of the non-vulnerability training sample.

Further, preprocessing the source code to obtain a binary code includes: splitting the source code into a number of token streams; carrying out grammatical analysis on the mark stream to obtain an analysis result; and when the analysis result is that the semantics are correct, generating a binary code corresponding to the source code by using a compiler.

Further, splitting the source code into a number of token streams includes: extracting a plurality of morphemes from the source code by using a scanning program, and constructing a corresponding data packet for each morpheme; and constructing a plurality of token streams based on the data packets corresponding to all morphemes.

In a second aspect, an embodiment of the present invention provides an embedded terminal software code vulnerability detection apparatus based on a neural network, where the apparatus includes: the system comprises an acquisition preprocessing module, a storage module and a transmission module, wherein the acquisition preprocessing module is used for acquiring a source code of target embedded terminal software and preprocessing the source code to obtain a binary code; the first input module is used for inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code; the second input module is used for comparing the similarity of the attribute control flow graph of the binary code and the attribute control flow graph of the vulnerability code to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library; and the determining module is used for determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the method according to any one of the above first aspects.

In a fourth aspect, the present invention provides a computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to execute the method according to any one of the above first aspects.

The invention provides a method and a device for detecting embedded terminal software code bugs based on a neural network, wherein the method comprises the following steps: firstly, acquiring a source code of target embedded terminal software, and preprocessing the source code to obtain a binary code; then inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code; inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a Siam network for similarity comparison to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library; and finally, determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result. According to the method, the source code is converted into the binary code through preprocessing, so that the complex vulnerability can be efficiently mined, the detection universality is improved, meanwhile, the dependency on a graph matching algorithm is overcome through a mode of generating an attribute control flow graph based on a neural network, and the detection efficiency is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an embedded terminal software code vulnerability detection method based on a neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an attribute control flow graph for generating binary code;

fig. 3 is a flowchart of another method for detecting a bug in an embedded terminal software code based on a neural network according to an embodiment of the present invention;

fig. 4 is a flowchart of another method for detecting a bug in an embedded terminal software code based on a neural network according to an embodiment of the present invention;

FIG. 5 is a flowchart of step S103 in FIG. 1;

FIG. 6 is another flowchart of step S103 in FIG. 1;

FIG. 7 is a flowchart of step S203 in FIG. 2;

fig. 8 is a flowchart of another method for detecting a bug in embedded terminal software code based on a neural network according to an embodiment of the present invention;

FIG. 9 is a flowchart of step S401 in FIG. 8;

fig. 10 is a schematic structural diagram of an embedded terminal software code vulnerability detection apparatus based on a neural network according to an embodiment of the present invention.

Icon:

11-obtaining a preprocessing module; 12-a first input module; 13-a second input module; 14-determination module.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Based on the above, the embodiment of the invention provides a method and a device for detecting the bug of the embedded terminal software code based on the neural network, which can efficiently complete the excavation of the complex bug by converting the source code into the binary code through preprocessing, thereby increasing the detection universality, overcoming the dependency on the graph matching algorithm by generating the attribute control flow graph based on the neural network, and improving the detection efficiency.

In order to facilitate understanding of the embodiment, a detailed description is first given to an embedded terminal software code vulnerability detection method based on a neural network disclosed in the embodiment of the present invention.

Example 1:

according to an embodiment of the present invention, there is provided an embodiment of an embedded terminal software code vulnerability detection method based on a neural network, it should be noted that the steps shown in the flowchart of the figure may be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that here.

Fig. 2 is a flowchart of a method for detecting a bug in an embedded terminal software code based on a neural network according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

and step S101, acquiring a source code of the target embedded terminal software, and preprocessing the source code to obtain a binary code.

In the embodiment of the invention, because the binary code and the source code are mixed in the currently acquired vulnerability database, on one hand, the vulnerability of the binary code is difficult to convert into the vulnerability of the source code, and the vulnerability of the source code is simple to convert into the vulnerability of the binary code, and on the other hand, the source code of the target embedded terminal software is difficult to acquire, and the binary code is easy to acquire. Based on the above two aspects, in this embodiment, when the vulnerability library has the vulnerability of the source code, the vulnerability of the source code is compiled into the vulnerability of the binary code, and the source code of the target embedded terminal software is converted into the binary code.

Compared with a source code, the binary code vulnerability detection method can omit processing on a compiling platform and language difference, increase universality and reduce resource consumption of the detection method in time and space, so that the binary code of the target embedded terminal software is used as a research object in the embodiment of the invention.

Step S102, inputting a characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code;

in the embodiment of the present invention, the binary code may be described by using a function call graph or an attribute control flow graph, where the function call graph is used to describe a call relationship between functions, and the attribute control flow graph is used to describe an internal structure of a function, and is characterized in that all statements of a function are divided into a plurality of basic blocks. A basic block is a continuous sequence of statements from which control flow enters and leaves at the end, with no breaks or branches.

In this embodiment, taking the description of binary code by using an attribute control flow graph as an example, referring to fig. 2, a characteristic function x of the binary code is split into x₁，x₂And x₃Three basic blocks, and x₁，x₂And x₃And inputting the three basic blocks into a pre-trained neural network, and obtaining an attribute control flow graph mu of the binary code through T iterations. The embodiment of the invention adopts the attribute control flow graph (ACFG in figure 2) to describe the binary code of the target embedded terminal software, can overcome the limitation of a graph matching algorithm, and has higher generation efficiency so as to obtain an accurate comparison result.

Step S103, inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a Siam network for similarity comparison to obtain a comparison result.

In an embodiment of the present invention, referring to FIG. 6, g₁And g₂Respectively representing the source code of the target embedded terminal software and the source code of the bug, mu₁And mu₁Respectively representing an attribute control flow graph of a binary code and an attribute control flow graph of a bug code, and after the attribute control flow graph of the binary code and the attribute control flow graph of the bug code are input into a Siam network, the Siam network outputs a comparison result Cos (mu)₁，μ₂)。

And step S104, determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

Referring to fig. 3, 4 and 6, a source code is obtained from a source code library, a vulnerability code is obtained from a preset vulnerability library, and after feature extraction is performed on the source code, an attribute control flow graph (μ in fig. 6) of a binary code is obtained₁) After the characteristics of the vulnerability code are extracted, an attribute control flow graph (mu in fig. 6) of the vulnerability code can be obtained₂) Inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a Siamese network, and outputting a comparison result by the Siamese network (Cos (mu) in figure 6)₁，μ₂) Comparative results Cos (. mu.)₁，μ₂) Can be represented by 1 and-1, in Cos (. mu.s)₁，μ₂) When 1, it indicates that there is a known bug in the source code of the embedded terminal software, in Cos (μ:)₁，μ₂) When the value is-1, the source code of the embedded terminal software has no known bugs.

The embodiment of the invention carries out similarity detection on the target embedded terminal software code based on the neural network and the Siamese network, thereby being capable of checking the existence of known bugs. Specifically, the method can efficiently finish the excavation of complex vulnerabilities by converting the source codes into the binary codes through preprocessing, increases the detection universality, overcomes the dependency on a graph matching algorithm by generating an attribute control flow graph based on a neural network, and improves the detection efficiency.

The method for detecting the vulnerability of the embedded terminal software code based on the neural network is described below with reference to specific embodiments.

As can be seen from the above description, in the embodiment of the present invention, in step S103, the attribute control flow graph of the binary code and the attribute control flow graph of the bug code are input to a siamese network for similarity comparison, and the obtained comparison result is an important link for implementing the embodiment of the present invention, and the dependency on the graph matching algorithm can be overcome by using the similarity comparison of the attribute control flow graphs.

In an optional embodiment, referring to fig. 5, in step S103, inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a siamese network for similarity comparison, and obtaining a comparison result includes:

step S201, inputting the attribute control flow graph of the binary code into a Siamese network to obtain a first characteristic vector.

In the embodiment of the invention, the first feature vector is used for representing the structural information of the attribute control flow graph of the binary code;

step S202, inputting the attribute control flow graph of the vulnerability code into a Siamese network to obtain a second characteristic vector.

In the embodiment of the invention, the second feature vector is used for representing the structural information of the attribute control flow graph of the vulnerability code;

step S203, comparing the similarity of the first feature vector and the second feature vector to obtain a comparison result.

The Siamese network takes the attribute control flow graph of the binary code and the attribute control flow graph of the bug code as input, and outputs a comparison result, wherein the comparison result can be similarity scoring. I.e. mu in fig. 6₁And mu₂As input, Cos (. mu.) (₁，μ₂) As an output.

Specifically, both the attribute control flow graph of the binary code and the attribute control flow graph of the bug code adopt a Structure2vec method. The Structure2vec method identifies network nodes according to the network Structure of the attribute control flow graph and the relationship between the nodes.

Taking an attribute control flow graph of a binary code as an example, the Structure2vec method specifically comprises the following processes: attribute control flow graph g of binary code₁And (V, E), where V is a vertex set, E is an edge set, a vertex V in the vertex set V contains an additional feature xv corresponding to a basic block feature (i.e. the number of children in the basic block attribute, the relevance of the children), the basic block feature is obtained according to the recursive aggregation synchronization of the graph topology, and the updating of the vertex in each round can be performed after the previous round of updating is completed, so that the vertex feature xv is propagated to other vertices through a nonlinear propagation function F. After the iterative update is completed, the delivery network will generateNew vertex-related features are embedded.

First, the image embedding network calculates p-dimensional features μ for each vertex V in the set of vertices V_vThen g₁Is embedded with the vector mu_gThe aggregation will be embedded as one computation vertex. I.e. mu_g:＝∑v∈V(μ_v)。

Structure2vec network embeds initialization into μ_v ⁽⁰⁾Set to 0, then update in each iteration:

wherein F is designed as:

wherein x_vIs a d-dimensional vector, W, of vertex features, i.e. features at the node or base block level in the attribute flow graph₁Is a d × p matrix, where p is the embedding size or embedding dimension.

According to the formula, synchronous output based on the graph topology can be obtained, and as the number of updated iterations is increased, the vertex features are transmitted to more distant vertices and are subjected to nonlinear aggregation. The method enables the embedded information containing the graph topology and the neighborhood thereof to be embedded without manually specifying nonlinear parameters, and the nonlinear parameters are obtained by training and learning a large amount of training sample data through a neural network.

σ (-) is a fully connected n-layer neural network:

wherein, P_i(i 1, …, n) is a p × p matrix, n is the embedding depth or embedding dimension, and ReLU is the corrected linear unit, i.e., ReLU (x) max {0, x }.

In an alternative embodiment, as shown in fig. 7, in step S203, performing similarity comparison on the first feature vector and the second feature vector, and obtaining a comparison result includes:

step S301, calculating cosine distances of the first eigenvector and the second eigenvector;

step S302, determining a comparison result based on the cosine distance.

In the embodiment of the invention, the final output result of the Siamese network is the cosine distance of two vectors. Furthermore, the two embedded networks share the same set of parameters and remain unchanged during the training of the two embedded networks. The Siamese network adds the two embedded networks into the top, and the output comparison result is as follows:

wherein the first feature vector φ (g) is generated by: step 1, inputting an attribute control flow graph as g ═ V, E; step 2, initializing each vertex V in the vertex set V, and mu _v ⁽⁰⁾0; step 3, within the iteration times, order

Step 4, let mu_v ^(t)＝tanh(W₁x_v+σ(l_v) ); step 5, obtaining

In an optional embodiment, as shown in fig. 8, before the step S103, inputting the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a siamese network for similarity comparison, the method further includes:

step S401, acquiring vulnerability training samples and/or non-vulnerability training samples;

in the embodiment of the invention, a vulnerability training sample comprises an attribute control flow graph of a vulnerability code, and a non-vulnerability training sample comprises an attribute control flow graph of a binary code;

step S402, training a Siamese network based on a vulnerability training sample and/or a non-vulnerability training sample, and optimizing a target function of the Siamese network by a quasi-Newton method in the process of training the Siamese network to obtain the optimized Siamese network.

In the embodiment of the invention, the Siamese network is optimized by a quasi-Newton method, and the Siamese network with good performance can be obtained by using the area under the concentration-time curve (AUC) as a measurement standard. The essential idea of the quasi-Newton method is to improve the defect that the Newton method needs to solve the inverse matrix of the complex Hessian matrix every time, and the quasi-definite matrix is used for approximating the inverse of the Hessian matrix, so that the complexity of the operation is simplified. Quasi-newton is the same as the steepest descent method, requiring only knowledge of the gradient of the objective function at each iteration step, and then by measuring the change in gradient, a model of the objective function is constructed that is sufficient to produce a super-linear convergence. The area under the concentration-time curve is generally defined as the area under the receiver operating characteristic curve (ROC curve) enclosed by the coordinate axes. The ROC curve is a curve drawn based on a series of different two classification methods (cut-off values or decision thresholds) with true positive rate (sensitivity) as ordinate and false positive rate (1-specificity) as abscissa. The AUC value is a probability value, when a positive sample and a negative sample are randomly selected, the probability that the positive sample is arranged in front of the negative sample by the current classification algorithm according to the computed Score value is the AUC value, and the larger the AUC value is, the more likely the current classification algorithm is to arrange the positive sample in front of the negative sample, thereby better classification can be achieved.

In view of the fact that models for particular source code may deviate from pre-trained default strategies, embodiments of the present invention fine-tune the siamese network training model by using the modified portion of the embedded network parameters, which are typically provided by experts in the field. These additional data make the embedded network more compliant with the policies of a particular task. Using such an enhanced data set, the present embodiment may further train the embedded network and sample more frequently than the old data. After the Siamese network is trained, the field expert feeds back and adjusts parameters to the system and the optimization model to improve the adaptability to the specific source code, further improve the accuracy of similarity detection and facilitate the detection of the existing known bugs.

The embodiment of the invention adds two embedded networks into the top as input, generates similarity score as output, optimizes a target function by a quasi-Newton method, and obtains a Siamese network with good performance by using the area under a concentration-time curve as a measurement standard.

In an alternative embodiment, as shown in fig. 9, in step S401, the obtaining of the vulnerability training samples and/or the non-vulnerability training samples includes the following steps:

step S501, obtaining an original vulnerability training sample and/or an original non-vulnerability training sample;

step S502, indexing an original vulnerability training sample by using a position sensitive Hash distributed index method to obtain a cluster of vulnerability training samples, and/or indexing an original non-vulnerability training sample by using a position sensitive Hash distributed index method to obtain a cluster of non-vulnerability training samples;

step S503, acquiring vulnerability training samples from clustering clusters of vulnerability training samples, and/or acquiring non-vulnerability training samples from clustering clusters of non-vulnerability training samples.

In the embodiment of the present invention, training the siamese network requires a large number of training samples, however, in most cases, the real data is limited. To solve this problem, the embodiment of the present invention uses a default strategy to consider an equivalent function, and the method generates a large data set (i.e., a cluster of vulnerability training samples) by using a set of given source codes (i.e., original vulnerability training samples), and pre-trains a model independent of a target but having a certain universality by using the data set. So that a variety of binary functions can be compiled from the same source code. Typically, generating embedded functions for each task requires capturing invariant features of the function to span different architectures and compilers. The embodiment of the invention overcomes the defect by using a fixed symbol mode to construct the data set. When only one set of source code is collected, embodiments of the present invention may compile their binary code through different architectures, use different compilers, and employ different optimizations, thereby obtaining a variety of embedded functions. However, for different functions compiled from the same source code, two binary functions that can be recognized by a trained siamese network are similar. A training data set is constructed for each binary function according to the standard, and a new training sample is constructed by sampling.

The embodiment of the invention uses the database based on the hash of the position sensitivity to index, considers the equivalent function by using a default strategy, performs pre-training, and then performs fine adjustment through a small amount of parameters to adapt to a specific task, thereby reducing the time required by operation, reducing the resources required by operation and improving the detection precision.

In an optional embodiment, in step S101, preprocessing the source code to obtain a binary code includes:

step 1, splitting a source code into a plurality of mark streams;

in the embodiment of the present invention, in step 1, splitting the source code into a plurality of token streams includes the following steps: step 11, extracting a plurality of morphemes from the source code by using a scanning program, and constructing a corresponding data packet for each morpheme; and step 12, constructing a plurality of symbol streams based on the data packets corresponding to all morphemes.

Step 2, carrying out semantic analysis on the mark stream to obtain a semantic analysis result;

and 3, when the semantic analysis result is that the semantics are correct, generating a binary code corresponding to the source code by using a compiler.

In the embodiment of the invention, the conversion process from the source code to the binary object code mainly passes through the following 4 stages: the first stage, lexical analysis stage: the scanner finds the characters and string data from the source file and classifies these data into tokens representing the source file morpheme items. A small data packet is created for each morpheme, and the character string morpheme is converted into a small mark packet, so that the character string morpheme is processed as an integer more efficiently than a character string, and the processing efficiency is improved. The second stage, the parsing stage: the compiler part is responsible for checking whether the syntax semantics of the source program are correct and for reorganizing the token stream into a more complex data structure to represent the meaning of the program, i.e. the semantics. By building a data structure representing the source code, different parts of the program are made easier to reference, relieving the burden of the code generation and optimization stages. The third stage, the intermediate code generation stage: since it is easier to operate on code at this stage, and many compilers are cross-platform, it is necessary to generate machine code that works on different CPU architectures. Thus not directly translated to native machine code but over-run by intermediate code. The fourth stage, the optimization stage: the main resources are memory (volume) and CPU cycles (speed), i.e. efficiency issues, considering the minimum occupation of some resources by the program. The time required to solve the problem is exponential to the input, and the compiler uses heuristic and case-like algorithms to determine the transformations to be taken to generate.

The embodiment of the invention provides a method for detecting a bug of an embedded terminal software code based on a neural network, which is characterized in that whether a known bug exists in a target embedded terminal software code is judged by detecting the similarity between a binary code of target embedded terminal software and a binary code of the bug. The embodiment of the invention carries out feature extraction and attribute control flow graph generation on a target embedded terminal software code, builds a neural network and a Siamese network, and trains the Siamese network into the Siamese network capable of detecting the code similarity. After the siamese network training is completed, the embodiment of the invention carries out preprocessing and characteristic extraction on the target embedded software code to be detected, and carries out similarity comparison on the attribute control flow graph of the binary code of the target embedded software and the attribute control flow graph of the bug code, thereby discovering the existing bug existing in the binary code of the target embedded software. The embodiment of the invention can detect the code similarity under a plurality of software platforms, is suitable for the characteristics of complex and diversified embedded terminal software codes, and has higher universality.

Therefore, the embodiment of the invention takes the binary code of the target embedded terminal software as a research object, after the code characteristics of the binary code are extracted, the neural network is utilized to generate an attribute control flow graph for the binary code, the neural network can also be utilized to generate the attribute control flow graph for the bug code, the target function of the Siamese network can be optimized through the quasi-Newton method, and then the Siamese network is trained; and finally, indexing by using a scattered database based on position sensitivity, and when the target software code is close to the image embedded network in the vulnerability library, indicating that the similarity exists, namely the known vulnerability exists in the target software code.

Example 2:

the embodiment of the invention also provides an embedded terminal software code vulnerability detection device based on the neural network, which is mainly used for executing the embedded terminal software code vulnerability detection method based on the neural network provided by the embodiment of the invention.

Fig. 10 is a schematic structural diagram of an embedded terminal software code vulnerability detection apparatus based on a neural network according to an embodiment of the present invention. As shown in fig. 10, the device for detecting bugs in embedded terminal software based on a neural network mainly includes: an acquisition preprocessing module 11, a first input module 12, a second input module 13, and a determination module 14, wherein:

the acquisition preprocessing module 11 is used for acquiring a source code of the target embedded terminal software and preprocessing the source code to obtain a binary code;

the first input module 12 is configured to input a feature function of the binary code to a pre-trained neural network to obtain an attribute control flow graph of the binary code;

a second input module 13, configured to input the attribute control flow graph of the binary code and the attribute control flow graph of the bug code to a siamese network for similarity comparison, so as to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library;

and the determining module 14 is used for determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

The embodiment of the invention provides an embedded terminal software code vulnerability detection device based on a neural network, which comprises: firstly, acquiring a source code of target embedded terminal software by using an acquisition preprocessing module 11, and preprocessing the source code to obtain a binary code; then, inputting the characteristic function of the binary code into a pre-trained neural network by using a first input module 12 to obtain an attribute control flow graph of the binary code; then, a second input module 13 is utilized to input the attribute control flow graph of the binary code and the attribute control flow graph of the bug code into a Siamese network for similarity comparison to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library; and finally, determining whether the source code of the embedded terminal software has a known bug or not by using the determining module 14 based on the comparison result. According to the method, the source code is converted into the binary code through preprocessing, so that the complex vulnerability can be efficiently mined, the detection universality is improved, meanwhile, the dependency on a graph matching algorithm is overcome through a mode of generating an attribute control flow graph based on a neural network, and the detection efficiency is improved.

Optionally, the second input module 13 comprises the following sub-modules:

the first input submodule is used for inputting the attribute control flow graph of the binary code into a Siamese network to obtain a first characteristic vector; the first feature vector is used for representing the structural information of the attribute control flow graph of the binary code;

the second input submodule is used for inputting the attribute control flow graph of the vulnerability code into a Siamese network to obtain a second characteristic vector; the second feature vector is used for representing structural information of an attribute control flow graph of the vulnerability code;

and the similarity comparison pair sub-module is used for comparing the similarity of the first characteristic vector and the second characteristic vector to obtain a comparison result.

Optionally, the similarity ratio pair sub-module comprises:

the computing unit is used for computing the cosine distance between the first characteristic vector and the second characteristic vector;

and the determining unit is used for determining a comparison result based on the cosine distance.

Optionally, the device for detecting a bug of an embedded terminal software code based on a neural network further includes:

the acquisition module is used for acquiring vulnerability training samples and/or non-vulnerability training samples; the vulnerability training sample comprises an attribute control flow graph of a vulnerability code, and the non-vulnerability training sample comprises an attribute control flow graph of a binary code;

the training optimization module is used for training a Siamese network based on a vulnerability training sample and/or a non-vulnerability training sample, and in the process of training the Siamese network, the target function of the Siamese network is optimized through a quasi-Newton method, and the optimized Siamese network is obtained.

Optionally, the obtaining module includes the following sub-modules:

the first obtaining submodule is used for obtaining an original vulnerability training sample and/or an original non-vulnerability training sample;

the indexing submodule is used for indexing the original vulnerability training sample by using a distributed indexing method of position sensitive Hash to obtain a clustering cluster of the vulnerability training sample, and/or indexing the original non-vulnerability training sample by using a distributed indexing method of position sensitive Hash to obtain a clustering cluster of the non-vulnerability training sample;

and the second obtaining sub-module is used for obtaining the vulnerability training samples from the clustering clusters of the vulnerability training samples and/or obtaining the non-vulnerability training samples from the clustering clusters of the non-vulnerability training samples.

Optionally, the acquisition preprocessing module 11 includes the following sub-modules:

a splitting submodule for splitting the source code into a plurality of token streams;

the syntax analysis submodule is used for carrying out syntax analysis on the mark stream to obtain a semantic analysis result;

and the generation submodule is used for generating a binary code corresponding to the source code by using the compiler when the semantic analysis result is that the semantic is correct.

Optionally, the split submodule includes the following units:

the extraction unit is used for extracting a plurality of morphemes from the source code by using the scanning program and constructing a corresponding data packet for each morpheme;

and the construction unit is used for constructing a plurality of symbol streams based on the data packets corresponding to all morphemes.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

In another embodiment of the present invention, an electronic device is further provided, which includes a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps of the method of the above method embodiment when executing the computer program.

In yet another embodiment of the invention, a computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of the method embodiment is also provided.

The method and the device for detecting the vulnerability of the embedded terminal software code based on the neural network and the computer program product of the electronic device provided by the embodiment of the invention comprise a computer readable storage medium storing program codes, instructions included in the program codes can be used for executing the method in the previous method embodiment, and specific implementation can be referred to the method embodiment and is not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present embodiment, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be configured and operated in a specific orientation, and thus, should not be construed as limiting the present embodiment. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in this embodiment, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present embodiment or parts of the technical solution may be essentially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A method for detecting code bugs of an embedded terminal software based on a neural network is characterized by comprising the following steps:

acquiring a source code of target embedded terminal software, and preprocessing the source code to obtain a binary code;

inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code;

comparing the similarity of the attribute control flow graph of the binary code and the attribute control flow graph of the vulnerability code to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library;

and determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

2. The method of claim 1, wherein the comparing the similarity between the attribute control flow graph of the binary code and the attribute control flow graph of the bug code to obtain a comparison result comprises:

inputting the attribute control flow graph of the binary code into a Siamese network to obtain a first characteristic vector; wherein the first feature vector is used for representing structural information of an attribute control flow graph of the binary code;

inputting the attribute control flow graph of the vulnerability code into the Siamese network to obtain a second characteristic vector; the second feature vector is used for representing structural information of an attribute control flow graph of the vulnerability code;

and carrying out similarity comparison on the first characteristic vector and the second characteristic vector to obtain a comparison result.

3. The method of claim 2, wherein performing similarity matching on the first feature vector and the second feature vector to obtain the matching result comprises:

calculating cosine distances of the first feature vector and the second feature vector;

determining the comparison result based on the cosine distance.

4. The method of claim 2, wherein before comparing the similarity between the property control flow graph of the binary code and the property control flow graph of the vulnerability code, the method further comprises:

acquiring a vulnerability training sample and/or a non-vulnerability training sample; the vulnerability training sample comprises an attribute control flow graph of a vulnerability code, and the non-vulnerability training sample comprises an attribute control flow graph of a binary code;

and training a Siamese network based on the vulnerability training sample and/or the non-vulnerability training sample, and optimizing a target function of the Siamese network by a quasi-Newton method in the process of training the Siamese network to obtain the optimized Siamese network.

5. The method of claim 4, wherein obtaining vulnerability training samples and/or non-vulnerability training samples comprises:

obtaining an original vulnerability training sample and/or an original non-vulnerability training sample;

indexing the original vulnerability training samples by using a distributed indexing method of position sensitive Hash to obtain clustering clusters of vulnerability training samples, and/or indexing the original non-vulnerability training samples by using a distributed indexing method of position sensitive Hash to obtain clustering clusters of non-vulnerability training samples;

and acquiring a vulnerability training sample from the clustering cluster of the vulnerability training sample, and/or acquiring a non-vulnerability training sample from the clustering cluster of the non-vulnerability training sample.

6. The method of claim 1, wherein preprocessing the source code to obtain binary code comprises:

splitting the source code into a number of token streams;

carrying out grammatical analysis on the mark stream to obtain an analysis result;

and when the analysis result is that the semantics are correct, generating a binary code corresponding to the source code by using a compiler.

7. The method of claim 6, wherein splitting the source code into a number of token streams comprises:

extracting a plurality of morphemes from the source code by using a scanning program, and constructing a corresponding data packet for each morpheme;

and constructing a plurality of token streams based on the data packets corresponding to all morphemes.

8. The utility model provides an embedded terminal software code vulnerability detection device based on neural network which characterized in that includes:

the system comprises an acquisition preprocessing module, a storage module and a transmission module, wherein the acquisition preprocessing module is used for acquiring a source code of target embedded terminal software and preprocessing the source code to obtain a binary code;

the first input module is used for inputting the characteristic function of the binary code into a pre-trained neural network to obtain an attribute control flow graph of the binary code;

the second input module is used for comparing the similarity of the attribute control flow graph of the binary code and the attribute control flow graph of the vulnerability code to obtain a comparison result; the vulnerability codes are known vulnerabilities in a preset vulnerability library;

and the determining module is used for determining whether the source code of the embedded terminal software has a known bug or not based on the comparison result.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any of claims 1 to 7.