CN110943981B - Cross-architecture vulnerability mining method based on hierarchical learning - Google Patents

Cross-architecture vulnerability mining method based on hierarchical learning Download PDF

Info

Publication number
CN110943981B
CN110943981B CN201911142076.7A CN201911142076A CN110943981B CN 110943981 B CN110943981 B CN 110943981B CN 201911142076 A CN201911142076 A CN 201911142076A CN 110943981 B CN110943981 B CN 110943981B
Authority
CN
China
Prior art keywords
function
hierarchical
model
learning
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911142076.7A
Other languages
Chinese (zh)
Other versions
CN110943981A (en
Inventor
吴昊
康绯
卜文娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN201911142076.7A priority Critical patent/CN110943981B/en
Publication of CN110943981A publication Critical patent/CN110943981A/en
Application granted granted Critical
Publication of CN110943981B publication Critical patent/CN110943981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of network information security, and particularly relates to a cross-architecture vulnerability mining method based on hierarchical learning, which comprises the following steps: acquiring training sample data; constructing a hierarchical learning model; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model; respectively inputting the characteristic information of the training sample data function into a hierarchical learning model and a clone edition hierarchical learning model for model training and learning, calculating the similarity represented by the function characteristic high-dimensional vectors obtained by the two models, and judging and adjusting parameters and weight of the hierarchical model according to the similarity labels to obtain a hierarchical model for target function vulnerability mining; and aiming at the target function, extracting function characteristic information and function calling relation which are input as a hierarchical model, and finishing target function vulnerability mining by training the learned hierarchical model. The method greatly improves the efficiency and the precision of large-scale vulnerability search work through two-stage machine learning and rich feature extraction, and has important guiding significance on network information safety.

Description

Cross-architecture vulnerability mining method based on hierarchical learning
Technical Field
The invention belongs to the technical field of network information security, and particularly relates to a cross-architecture vulnerability mining method based on hierarchical learning.
Background
Genius learns the high-level feature representation from the control flow graph and encodes the graph as an embedding (i.e., high-dimensional numerical vector). The Genius uses a graph matching algorithm to cluster similar functions so as to extract the robust features of the CFG in different compiling environments of a cross-system structure, thereby generating a codebook, and generating function embedding according to the codebook; then, establishing a firmware database and a vulnerability function database, and using LSH (local sensitive Hash) for large-scale vulnerability search; embedding generation is inefficient, and in addition, Genius' search accuracy is not sufficient to satisfy large-scale vulnerability search efforts in millions of firmware. Gemini proposes a method for generating an embedding of a binary function for similarity detection based on a deep neural network, which improves accuracy and efficiency to a certain extent, extracts robust features of a cross-architecture function, and provides the extracted basic block-level features and a representation of a CFG structure to a DNN model; through multiple iterations of Structure2Vec, the basic block node features are propagated to other nodes associated therewith, and the representations of all basic block nodes are aggregated to generate a high-dimensional vector representation of the function; however, the method does not overcome the limitation of Genius matching method based on CFG (computational fluid dynamics) diagram, the influence of different compiling options on CFG structure is not fully considered, and the searching precision is not enough to meet the large-scale vulnerability searching work in millions of firmware.
Disclosure of Invention
Therefore, the invention provides a hierarchical learning-based cross-architecture vulnerability mining method, which improves the precision of large-scale vulnerability search work and the vulnerability search accuracy and precision by rich feature extraction based on two-level hierarchical learning.
According to the design scheme provided by the invention, a cross-architecture vulnerability mining method based on hierarchical learning is provided, which comprises the following steps:
acquiring an assembly program corresponding to a binary function through disassembling, extracting function characteristic information and a function calling relation of the assembly function in the assembly program through graphic description, combining the assembly functions in pairs, and adding similar labels to form a function pair as training sample data;
constructing a hierarchical learning model, wherein the hierarchical learning model comprises a function inner level learning module constructed based on a deep neural network and a function inter-level learning module constructed based on a map attention network; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model;
respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by utilizing the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; respectively and correspondingly feeding the function call relation and the middle embedding of each function in the function pair to an inter-function feature learning module in the two models for training and learning, and acquiring high-dimensional vector representation of the function features; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;
and aiming at the target function, acquiring an assembler program corresponding to the target function through disassembling, creating a control flow graph for the assembler function in the assembler program, extracting function characteristic information and a function call relation which are input as a hierarchical model through graphic description, and finishing target function vulnerability mining through training the learned hierarchical model.
As the cross-architecture vulnerability mining method based on hierarchical learning, further, in the graph description, a control flow graph, a data flow graph and a function call graph are created aiming at an assembler, function instructions are divided by basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is added to the structure of the control flow graph, a symbol mark for representing whether data transmission exists between the two basic blocks is added, edges between the nodes in the control flow graph represent the direction of a control flow, and the labels of the edges represent whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.
As the cross-architecture vulnerability mining method based on hierarchical learning, further, the determining of data transmission information further comprises: whether there is a data transfer between two basic blocks is determined by checking whether instructions in the basic blocks access the same address register.
As the cross-architecture vulnerability mining method based on hierarchical learning, further, in a function call graph, for a dynamically loaded third-party function library, function names are obtained by extracting an import table of a binary executable file; representing the call-called relation between functions as a directed unweighted edge; and constructing an adjacency matrix of the function calling relation according to the function calling graph.
As the cross-architecture vulnerability mining method based on hierarchical learning, the method further adopts a model-oriented genetic algorithm to extract function characteristic information, and in the genetic algorithm, a population is used for representing a function characteristic subset, and a generation is used for representing the number of iteration rounds.
As the cross-architecture vulnerability mining method based on hierarchical learning, the invention further extracts function characteristic information by using a genetic algorithm, and the method comprises the following contents: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations.
As the cross-architecture vulnerability mining method based on hierarchical learning, the Structure2Vec is used in a function inner level learning module, the vertex characteristics of a control flow graph are recursively aggregated in a nonlinear mode according to graph topology, multidimensional embedding containing neighborhood information and multidimensional vertex characteristics of vertexes with data transmission of the vertexes is generated in each iteration, and control flow graph embedding vectors are aggregated after multidimensional embedding is generated on each vertex.
As the cross-architecture vulnerability mining method based on hierarchical learning, the updating and embedding in iteration are further carried out through the nonlinear transformation defined by a complete connection network in each iteration; after iteration, each vertex feature will propagate to other nodes associated with it, and each vertex embedding contains context semantics.
As the cross-architecture vulnerability mining method based on hierarchical learning, the importance is calculated by using a shared attention mechanism in an inter-function level learning module, the attention coefficient for nonlinear transformation to obtain function embedding is obtained, and the function embedding which is used as the module input and contains the called function semantics is obtained by combining each characteristic vertex shared parameterized weight matrix.
As the cross-architecture vulnerability mining method based on hierarchical learning, the cosine distance is further utilized to obtain the function similarity, and the model quality is evaluated by comparing the similarity with the similar label difference.
The invention has the beneficial effects that:
the efficiency and the precision of large-scale vulnerability search work are greatly improved through two-stage machine learning and rich feature extraction; by introducing the concept of function structured signature, describing binary codes of a target function by using a graph, extracting signature information from the graph, increasing the semantics of a function call relation, performing feature selection by using a model-oriented feature selection method, and adding function data stream information to enrich the semantics in the function, on the premise that the search efficiency is enough to meet the requirement of large-scale vulnerability search work in the real world, the accuracy and precision of vulnerability search are obviously improved, and the method has important guiding significance on network information safety.
Description of the drawings:
FIG. 1 is a schematic diagram of a cross-architecture vulnerability mining method in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a hierarchical model according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an intra-function level learning module according to an embodiment of the present invention;
FIG. 4 is a block diagram of an inter-function level learning module in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a self-attention mechanism according to an embodiment of the present invention.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
In the prior art, extracting and utilizing characteristics of a binary function to deal with the challenge of huge differences of binary functions in different compiling environments, for this reason, in an embodiment of the present invention, a cross-architecture vulnerability mining method based on hierarchical learning is provided, as shown in fig. 1, including:
s101, acquiring an assembly program corresponding to a binary function through disassembling, extracting function characteristic information and a function calling relation of the assembly function in the assembly program through graphic description, combining the assembly functions in pairs, and adding similar labels to form a function pair as training sample data;
s102, constructing a hierarchical learning model, wherein the hierarchical learning model comprises a function inner level learning module constructed based on a deep neural network and a function inter-level learning module constructed based on a map attention network; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model;
s103, respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by using the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; respectively and correspondingly feeding the function call relation and the middle embedding of each function in the function pair to an inter-function feature learning module in the two models for training and learning, and acquiring high-dimensional vector representation of the function features; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;
s104, aiming at the target function, acquiring an assembler corresponding to the target function through disassembling, creating a control flow graph for the assembler in the assembler, extracting function characteristic information and a function call relation which are input as a hierarchical model through graph description, and finishing target function vulnerability mining through the hierarchical model after training and learning.
The precision of large-scale vulnerability search work is improved through rich feature extraction based on two-level hierarchical learning, and vulnerability search accuracy and precision are improved.
As a cross-architecture vulnerability mining method based on hierarchical learning in the embodiment of the present invention, further, in graph description, a control flow graph, a data flow graph and a function call graph are created for an assembler, function instructions are divided by basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is attached to a control flow graph structure, a symbol mark for indicating whether data transmission exists between two basic blocks is added, an edge between nodes in the control flow graph represents a direction of a control flow, and a label of the edge represents whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.
Control flow diagrams and data flow diagrams describe the control and data flow of functions. The instructions of a function may be divided into several basic blocks, but the nodes of the control flow graph CFG and the data flow graph DFG are composed of different divided basic blocks. Therefore, in embodiments, data transfer information is appended to the structure of the CFG, and may be labeled 0 and 1 to indicate whether there is data transfer between two basic blocks. The edges between CFG nodes represent the direction of control flow; the label on the CFG edge indicates whether there is data transfer between the two basic blocks. The function control flow and the data flow have certain robustness on different architectures, different operating systems and different compiling optimization levels. The impact can be eliminated by extracting semantic features of control and data streams as functions, i.e., extracting common features across platforms and different compilation options that are independent of platform and compilation settings. In the embodiment of the invention, further, a model-oriented genetic algorithm is adopted to extract the function characteristic information, in the genetic algorithm, the population is used for representing the function characteristic subset, and the generation is used for representing the number of iteration rounds. The function call describes the call relationship between the functions to be analyzed. The function call graph relates to the function call relationship of the entire binary file. Each node represents a function, each function node may be represented by a function CFG with data flow information. Even in different compilation environments, the call relationship between functions is a very robust feature. There is no difference between functions that the same function calls in different environments. Therefore, in the embodiment of the invention, the calling relation between the functions is combined with the idea of crowd classification in the social network, and the influence of the calling function on function identification is considered, so that more accurate function characteristic representation is generated.
The binary functions are described by using the three graphs, and signature information is generated for the binary functions based on the graphs, so that basic block-level features which are still stable under different optimization options and cross-architecture can be extracted, and the binary codes of the target function are described by introducing the concept of function structured signature and using graphs to extract the signature information from the graphs by combining CFG attached with data flow information and calling function information in the function calling graph as signature factors of the functions, so that the comparison is facilitated, and the feature extraction precision and efficiency are improved.
Two neural networks are used to learn two-level features of the function. For the intra-function level learning model, based on Gemini, and adding data flow information to capture more function semantic information. Since the number of function calls is variable and the relationships between function calls are not well expressed on the Gemini model, handling function call relationships is more difficult. Therefore, new mechanisms or models are introduced to learn the semantic information of function calls. Finally, inspired by social networks, the embodiments of the present invention introduce an attempt to seek to solve the above problems in an effort to try GAT.
The deep neural network DNN is used to train basic block-level features of the learning function, while GAT is used to train the impact of call relations on the function to generate a high-dimensional feature vector containing more precise semantics. And finally, measuring the similarity between the functions by calculating the distance of the feature vectors of the functions, thereby determining the vulnerability function.
Existing semantic learning methods rely on the CFG of a function to extract features of each basic block of the function and perform similarity comparisons based on these features. BiN features are extracted by first disassembling the binary file into a corresponding assembler using an IDAPro tool. Then, a CFG is created for each assembly function using IDAPython provided by IDAPro. Meanwhile, in the embodiment of the present invention, a plug-in named MIASM of IDAPro may be used to determine whether there is data transmission between two basic blocks. Some changes may be made to the features extracted in the Gemini to take into account the effects of the compilation environment on the assembly code. Different from the function characteristics selected by Gemini directly using DiscovRE based on a graph matching algorithm, the embodiment of the invention designs a model-oriented genetic algorithm and reselects the function characteristics more suitable for the model. Extracting function characteristic information by using a genetic algorithm, wherein the function characteristic information comprises the following contents: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations. For example, 50 functional features are extracted and the 9 features that perform best are selected. The algorithm 1 provides a model-oriented genetic algorithm for selecting binary function features, and the specific design content is as follows:
Figure GDA0003284852830000061
in the algorithm model, the "population" (population) is the selected subset of the functional features, and the "generation" refers to the number of rounds. For each population in the model, first a pairing set (matching set) and offspring (offset) are initialized, pairs of functions that implement the labels that calibrate the facts are passed into the model, and the fitness (fitness) of the population is obtained. They are then ranked according to their fitness and a random sampling with substitutions (replacement), crossovers (crossovers) and mutations (mutations) is used to select the population. Finally, the population is updated. After T generations, the subset with the best performance is selected as the final selection. The features shown in table 1 are the initial features of each basic block used, including 8 statistical features and 1 structural feature. The initial features of each basic block (9-dimensional vector) of the function are input into the model to generate a semantic embedded vector of the function.
Figure GDA0003284852830000071
In addition, data flow is a reliable representation, data flow between basic blocks can capture data and variable dependencies, and data flow analysis can provide the ability to locate inline code. To facilitate coordination with the structure of the CFG, information of the data stream is appended to the basic blocks of the CFG structure. At the edges of the CFG, 0 and 1 are marked to indicate whether there is data transfer between the two basic blocks to enrich the functional semantics. In addition, the data transfer between two basic blocks is determined by checking whether the instructions in the basic blocks access the same address register. Since it considers control structure and data flow information within the function, it can effectively mitigate structural changes of the CFG due to different compilation environments. However, when extracting the function features from the binary file, the method cannot capture the interaction between the functions, and a large amount of structural information is lost. In order to obtain the structural information, a new function structural signature is introduced, and the calling relationship between functions and the internal control flow information of the functions are used as the characteristics for describing the functions. The function call graph FCG is a directed graph representation in which the vertices of the graph represent functions and the directed edges represent the calling relationships of the functions. When the FCG is extracted, for the dynamically loaded third-party function library, the function name is obtained as a tag by reading the import table of the binary executable file. For a local function, the start address of the local function may be used as a function tag. The call-called relationships between functions are represented as directed unweighted edges. Finally, an adjacency matrix of function call relationships is constructed from the FCGs. In the model, function features are extracted by considering only internal functions of the file and statically linked third-party library functions, and the function call graph constitutes an adjacency list to represent function call relations of the binary file, and the extracted features constitute a structured signature of the function.
In training and learning through the model, a mapping phi is required to be found, and the mapping phi converts a binary function into a representation form of a high-dimensional feature vector. Given two binary functions f1And f2Then the similarity of the two functions is judged by a given similarity calculation function Sim (·,), i.e. if the two functions are similar, then Sim (phi (f) is1),φ(f2) ) has a value of 1. If the two functions are not similar, then Sim (phi (f)1),φ(f2) ) has a value of-1.
In the model, an attempt is made to learn how to generate function embeddings by deep learning (i.e., mapping φ). Different from Gemini, the two-stage characteristics of the function, the characteristics of the internal function and the calling relationship among the functions can be fully considered, in the embodiment, the vector representing more semantics of the function is generated by establishing the learning model of the two-stage characteristics and reasonably and effectively integrating the training model of the two-stage characteristics, so that the accuracy is further improved.
Referring to FIG. 2, the input is an extracted CFG containing data stream information extracted from firmware binary code and extracted basic block-level features, and N is in binary codeA function. Inputting the above-mentioned data into the feature learning model in the function, and obtaining intermediate embedding; 5 iterations were performed in the intra-function feature learning model, each iteration having 2 fully connected neural networks to train the features. The intermediate embedded content is then fed into an inter-function feature learning model, as shown in fig. 2, with 3 hidden layers in the model. And determining the dependency relationship of the function by using the adjacency matrix, calculating the influence of adjacent nodes on the function node through an attention mechanism, and generating a final representation form which is not related graphically and contains function calling information. By obtaining a mapping phi, the binary function is converted to a high-dimensional representation using deep learning. Next, the method of similarity between functions is described by a high-dimensional representation of the functions, and the model is trained in this way. In data preprocessing, the function set is divided into L pairs of function pairs. If the pair is two identical functions compiled from the same source in different compilation environments, i.e., Sim (f)i),φ(fi')) then the function assigns a 1 to the fact label
Figure GDA0003284852830000081
Otherwise, if there are two different functions, i.e., Sim (f)i),φ(fj) The function assigns-1 to the fact label
Figure GDA0003284852830000082
Furthermore, the similarity of the functions can be described by the cosine distance as
Figure GDA0003284852830000083
In the training phase, the model evaluates the quality of the mapping by comparing the generated similarity to the difference of the fact labels of the function pairs. The Mean Square Error (MSE) can be used as a metric, and the formula is as follows:
Figure GDA0003284852830000084
then, training the co-in-function modelShared parameter W1,W2,P1,…,PnAnd a shared parameter W in the inter-function model,
Figure GDA0003284852830000091
to minimize the MSE in the equation. Furthermore, to improve the generalization capability of the model, the model was built by adding L2 regularization and Dropout to prevent over-fitting. MSE was optimized using a random gradient descent algorithm. Finally, once the optimal values of the sharing parameters are obtained, the function can be easily converted to a high-dimensional representation for functional similarity comparison.
The internal feature learning model of the function is an improvement of the Structure2vec model of Gemini, and the main purpose of the internal feature learning model is to obtain high-dimensional vectors representing functions, such as control flow and data flow, i.e. to generate embedding. Structure2Vec is inspired by graph model inference algorithms. The features of its vertices are recursively non-linearly aggregated according to the graph topology. After sufficient iterations have been performed, each vertex will contain information about neighboring vertices. After extracting the basic block-level 9-dimensional feature representation of the function in the target binary file, the features are input into a learning model to generate semantic embeddings for similarity computation.
In FIG. 3, (a) is a CFG, the data transmission of which is shown as
Figure GDA0003284852830000092
There are 3 containing basic block level features in the figure
Figure GDA0003284852830000093
Of which
Figure GDA0003284852830000094
And ξ are the set of vertices and edges. After T-layer iterations, the DNN model will be for each vertex
Figure GDA0003284852830000095
Generating a p-dimensional embedding, and each iteration generating a p-dimensional vertex feature mu containing neighborhood information thereof and feature information of a vertex having data transmission with the vertexi. Generating each vertexIs embedded inTThen, the embedded vector μ of g will pass the formula
Figure GDA0003284852830000096
Performing polymerization calculation; (b) a method of updating the embedding in each iteration is shown. Further, an embodiment of the present invention further provides an update embedding method, and specific contents may be designed as follows:
Figure GDA0003284852830000097
wherein xvIs a vector of d-dimensional values, W, for each vertex1Is a d × p matrix. N (μ) is represented as a set of control flow neighbor vertices for vertex μ, and D (μ) is represented as a set of neighbor vertices with data transfer, respectively. Further, σcAnd σdIs two non-linear transformations σ (-) defining n layers of fully connected networks to enable the process of collecting other vertex features, e.g.
Figure GDA0003284852830000098
Figure GDA0003284852830000099
Wherein P isi(i ═ 1.., n) and Pi' (i ═ 1.., n) is a p × p matrix, and n is the embedding depth. ReLU is the activation function. Algorithm 2 outlines the overall algorithm for level embedding within the generating function, and the specific contents are as follows:
Figure GDA0003284852830000101
after T iterations, the features of each vertex will propagate to other vertices associated with it, and each vertex embeds semantics that all contain context. For the intra-function feature learning model, more functional semantic information is captured by adding data flow information compared to the existing research of DiscovRE, Genius and Gemini.
The function call relation has strong semantics under different compiling environments. Handling function call relationships is more difficult because the number of function calls is variable and the relationships between function calls are not well expressed on the Gemini model. Thus, a new mechanism or model may be introduced to learn the semantic information of the function call. Ideally, it is desirable that the model have the following characteristics:
(1) the model may act on the neighborhood of nodes, i.e. function calls;
(2) the model may assign different importance to adjacent nodes of the function;
(3) the model is suitable for generalizing problems and can handle any untrained graph structure.
It should be noted that the function does not have a fixed number of adjacent nodes in the call relation matrix. Therefore, in the enlightenment of the social network, when the embedded network of the function calling relation is built, the model is modified on the basis of the GAT network. GAT addresses some of the problems in graph convolution using a hidden self-attention layer. No complex matrix operations or a priori knowledge of the graph structure are required. By stacking the self-attention layers, different importance is assigned to different nodes adjacent to each other during the convolution process. In addition, the GAT is independent of the global graph structure due to the edge mechanism, and therefore can be easily applied to the induction problem. Inspired by the GAT model, the inter-function learning model includes the main steps that, as shown in FIG. 4, the input is a set of function embeddings, i.e., a high-dimensional representation of the functions generated in the previous stage, i.e.
Figure GDA0003284852830000102
Where N is the number of functions and F represents the dimension of the feature in each function. In addition, M can be represented by a contiguous matrix of binary file function calls. First order neighbor node set C considering each function node iiThe model will generate a final function vector representation
Figure GDA0003284852830000111
According to the attention coefficient alpha of the node i of each called function node jij
In order to convert the input features into a higher level of function embedding that contains the semantics of the called function, it is necessary to find a way to indicate the importance of each function node to which function node i calls. By self-attentive to function nodes, i.e. using a shared attention mechanism
Figure GDA0003284852830000112
To calculate importance; it is expressed as an attention coefficient
Figure GDA0003284852830000113
Wherein
Figure GDA0003284852830000114
And
Figure GDA0003284852830000115
is the initial feature of two functions, W is the shared parameterized weight matrix applied to each feature vertex for linear transformation. Once the normalized attention coefficient α is obtainedijIn a non-linear transformation
Figure GDA0003284852830000116
Then calculating the final function embedding, wherein CiIs the function called by function i.
Weighting matrix
Figure GDA0003284852830000117
Initialization is a linear transformation shared between function nodes. Each vertex also needs a self-attention mechanism to compute the importance of the function vertex j to vertex i, finding a way to obtain importance between two functions using the initial feature vector.
An overview of the self-attention mechanism is shown in fig. 5. In this embodiment model, the attention mechanism is a single layer feedforward neural network that passes parameterized weight vectors
Figure GDA0003284852830000118
To achieve the importance of a compute node, it can be set as:
Figure GDA0003284852830000119
wherein LeakyReLU is a non-linear activation function ·TIndicating transposition and | | indicating concatenation operation. Considering efficiency issues, only the first order key is used to calculate eijAnd use Softmax to convert eijNormalization
Figure GDA00032848528300001110
Now, the attention coefficient is obtained and the final functional representation can be calculated by the following formula:
Figure GDA00032848528300001111
ELU and tf max were chosen as σ (-). In order to ensure the stability of the attention mechanism learning process, a multi-head mechanism is added, namely K independent attention mechanisms are executed simultaneously, and a representation of each function is generated through averaging.
Figure GDA0003284852830000121
The intra-function feature learning model implementation is represented as algorithm 3:
Figure GDA0003284852830000122
unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
Based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.
Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.
The system/apparatus provided by the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, no mention is made in the system/apparatus embodiments, and reference may be made to the corresponding contents in the foregoing method embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the system/apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A cross-architecture vulnerability mining method based on hierarchical learning is characterized by comprising the following contents:
acquiring an assembly program corresponding to a binary function through disassembling, extracting function characteristic information and a function calling relation of the assembly function in the assembly program through graphic description, combining the assembly functions in pairs, and adding similar labels to form a function pair as training sample data;
constructing a hierarchical learning model, wherein the hierarchical learning model comprises a function inner level learning module constructed based on a deep neural network and a function inter-level learning module constructed based on a map attention network; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model;
respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by utilizing the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; feeding the function call relation of each function in the function pair and the function characteristic vector embedded in the middle into the inter-function characteristic learning modules in the two models for training and learning, and obtaining the high-dimensional vector representation of the function characteristic; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;
and aiming at the target function, acquiring an assembler program corresponding to the target function through disassembling, creating a control flow graph for the assembler function in the assembler program, extracting function characteristic information and a function call relation which are input as a hierarchical model through graphic description, and finishing target function vulnerability mining through training the learned hierarchical model.
2. The method for mining the cross-architecture vulnerability based on the hierarchical learning of claim 1, characterized in that in the graph description, a control flow graph, a data flow graph and a function call graph are created for an assembler, function instructions are divided into basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is attached to the structure of the control flow graph, a symbolic mark for indicating whether data transmission exists between the two basic blocks is added, edges between the nodes in the control flow graph represent the direction of a control flow, and the labels of the edges represent whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.
3. The method of claim 2, wherein determining data transmission information further comprises: whether there is a data transfer between two basic blocks is determined by checking whether instructions in the basic blocks access the same address register.
4. The method for mining the cross-architecture vulnerability based on the hierarchical learning of claim 2, wherein in the function call graph, for a dynamically loaded third-party function library, the function name is obtained by extracting an import table of a binary executable file; representing the call-called relation between functions as a directed unweighted edge; and constructing an adjacency matrix of the function calling relation according to the function calling graph.
5. The method for mining the cross-architecture vulnerability based on the hierarchical learning according to any one of claims 1 to 4, characterized in that a model-oriented genetic algorithm is adopted to extract function feature information, wherein in the genetic algorithm, a population is used to represent a function feature subset, and a generation is used to represent the number of iteration rounds.
6. The method of claim 5, wherein the function feature information is extracted by a genetic algorithm, and the method comprises the following steps: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations.
7. The method of claim 1, wherein in the intra-function level learning module, Structure2Vec is used, vertex features of the control flow graph are recursively and nonlinearly aggregated according to graph topology, multidimensional embedding including neighborhood information and multidimensional vertex features of vertices having data transmission with the vertices is generated each time of iteration, and control flow graph embedding vectors are aggregated after multidimensional embedding is generated for each vertex.
8. The method of claim 7, wherein in each iteration, updating and embedding in the iteration are performed through nonlinear transformation defined by a fully-connected network; after iteration, each vertex feature will propagate to other nodes associated with it, and each vertex embedding contains context semantics.
9. The method of claim 1, wherein in the inter-function level learning module, a shared attention mechanism is used to calculate importance, and obtain attention coefficients for nonlinear transformation to obtain function embedding, and in combination with each feature vertex shared parameterized weight matrix, function embedding containing called function semantics is obtained for use as module input.
10. The method of claim 1, wherein the cosine distance is used to obtain functional similarity, and model quality is evaluated by comparing similarity with similar label differences.
CN201911142076.7A 2019-11-20 2019-11-20 Cross-architecture vulnerability mining method based on hierarchical learning Active CN110943981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911142076.7A CN110943981B (en) 2019-11-20 2019-11-20 Cross-architecture vulnerability mining method based on hierarchical learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911142076.7A CN110943981B (en) 2019-11-20 2019-11-20 Cross-architecture vulnerability mining method based on hierarchical learning

Publications (2)

Publication Number Publication Date
CN110943981A CN110943981A (en) 2020-03-31
CN110943981B true CN110943981B (en) 2022-04-08

Family

ID=69907040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911142076.7A Active CN110943981B (en) 2019-11-20 2019-11-20 Cross-architecture vulnerability mining method based on hierarchical learning

Country Status (1)

Country Link
CN (1) CN110943981B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475622A (en) * 2020-04-08 2020-07-31 广东工业大学 Text classification method, device, terminal and storage medium
CN111562943B (en) * 2020-04-29 2023-07-11 海南大学 Code clone detection method and device based on event embedded tree and GAT network
CN111639344B (en) * 2020-07-31 2020-11-20 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network
CN112308210B (en) * 2020-10-27 2023-04-07 中国人民解放军战略支援部队信息工程大学 Neural network-based cross-architecture binary function similarity detection method and system
CN113204764B (en) * 2021-04-02 2022-05-17 武汉大学 Unsigned binary indirect control flow identification method based on deep learning
CN113240041B (en) * 2021-05-28 2022-11-08 北京理工大学 Binary function similarity detection method fusing influence factors
CN113688036A (en) * 2021-08-13 2021-11-23 北京灵汐科技有限公司 Data processing method, device, equipment and storage medium
CN113821804B (en) * 2021-11-24 2022-03-15 浙江君同智能科技有限责任公司 Cross-architecture automatic detection method and system for third-party components and security risks thereof
CN116028941B (en) * 2023-03-27 2023-08-04 天聚地合(苏州)科技股份有限公司 Vulnerability detection method and device of interface, storage medium and equipment
CN117574393B (en) * 2024-01-16 2024-03-29 国网浙江省电力有限公司 Method, device, equipment and storage medium for mining loopholes of information terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework
US20180082064A1 (en) * 2016-09-20 2018-03-22 Sichuan University Detection method for linux platform malware
CN110287702A (en) * 2019-05-29 2019-09-27 清华大学 A kind of binary vulnerability clone detection method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108622B (en) * 2017-12-13 2021-03-16 上海交通大学 Vulnerability detection system based on deep convolutional network and control flow graph

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework
US20180082064A1 (en) * 2016-09-20 2018-03-22 Sichuan University Detection method for linux platform malware
CN110287702A (en) * 2019-05-29 2019-09-27 清华大学 A kind of binary vulnerability clone detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《VDNS:一种跨平台的固件漏洞关联算法》;常青、刘中金等;《计算机研究与发展》;20161015;全文 *

Also Published As

Publication number Publication date
CN110943981A (en) 2020-03-31

Similar Documents

Publication Publication Date Title
CN110943981B (en) Cross-architecture vulnerability mining method based on hierarchical learning
Sun et al. What and how: generalized lifelong spectral clustering via dual memory
Garreta et al. Learning scikit-learn: machine learning in python
Benchaji et al. Using genetic algorithm to improve classification of imbalanced datasets for credit card fraud detection
Li et al. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation
Zhao et al. Learning to map social network users by unified manifold alignment on hypergraph
CN111506714A (en) Knowledge graph embedding based question answering
Du et al. Graph-based class-imbalance learning with label enhancement
Garreta et al. Scikit-learn: machine learning simplified: implement scikit-learn into every step of the data science pipeline
Zhou et al. Table2Charts: recommending charts by learning shared table representations
Raza et al. Understanding and using rough set based feature selection: concepts, techniques and applications
Zhang et al. Deep unsupervised self-evolutionary hashing for image retrieval
Xu et al. Transductive visual-semantic embedding for zero-shot learning
CN115344863A (en) Malicious software rapid detection method based on graph neural network
KR20230094955A (en) Techniques for retrieving document data
Bao et al. Asymmetry label correlation for multi-label learning
CN112182144B (en) Search term normalization method, computing device, and computer-readable storage medium
CN113535947A (en) Multi-label classification method and device for incomplete data with missing labels
CN111241326B (en) Image visual relationship indication positioning method based on attention pyramid graph network
CN116720519A (en) Seedling medicine named entity identification method
CN116662991A (en) Intelligent contract intention detection method based on artificial intelligence
CN116208399A (en) Network malicious behavior detection method and device based on metagraph
Liang et al. A normalizing flow-based co-embedding model for attributed networks
CN116226404A (en) Knowledge graph construction method and knowledge graph system for intestinal-brain axis
Pandi et al. A novel similarity measure for sequence data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant