CN110943981B

CN110943981B - Cross-architecture vulnerability mining method based on hierarchical learning

Info

Publication number: CN110943981B
Application number: CN201911142076.7A
Authority: CN
Inventors: 吴昊; 康绯; 卜文娟
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2022-04-08
Anticipated expiration: 2039-11-20
Also published as: CN110943981A

Abstract

The invention belongs to the technical field of network information security, and particularly relates to a cross-architecture vulnerability mining method based on hierarchical learning, which comprises the following steps: acquiring training sample data; constructing a hierarchical learning model; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model; respectively inputting the characteristic information of the training sample data function into a hierarchical learning model and a clone edition hierarchical learning model for model training and learning, calculating the similarity represented by the function characteristic high-dimensional vectors obtained by the two models, and judging and adjusting parameters and weight of the hierarchical model according to the similarity labels to obtain a hierarchical model for target function vulnerability mining; and aiming at the target function, extracting function characteristic information and function calling relation which are input as a hierarchical model, and finishing target function vulnerability mining by training the learned hierarchical model. The method greatly improves the efficiency and the precision of large-scale vulnerability search work through two-stage machine learning and rich feature extraction, and has important guiding significance on network information safety.

Description

Cross-architecture vulnerability mining method based on hierarchical learning

Technical Field

The invention belongs to the technical field of network information security, and particularly relates to a cross-architecture vulnerability mining method based on hierarchical learning.

Background

Genius learns the high-level feature representation from the control flow graph and encodes the graph as an embedding (i.e., high-dimensional numerical vector). The Genius uses a graph matching algorithm to cluster similar functions so as to extract the robust features of the CFG in different compiling environments of a cross-system structure, thereby generating a codebook, and generating function embedding according to the codebook; then, establishing a firmware database and a vulnerability function database, and using LSH (local sensitive Hash) for large-scale vulnerability search; embedding generation is inefficient, and in addition, Genius' search accuracy is not sufficient to satisfy large-scale vulnerability search efforts in millions of firmware. Gemini proposes a method for generating an embedding of a binary function for similarity detection based on a deep neural network, which improves accuracy and efficiency to a certain extent, extracts robust features of a cross-architecture function, and provides the extracted basic block-level features and a representation of a CFG structure to a DNN model; through multiple iterations of Structure2Vec, the basic block node features are propagated to other nodes associated therewith, and the representations of all basic block nodes are aggregated to generate a high-dimensional vector representation of the function; however, the method does not overcome the limitation of Genius matching method based on CFG (computational fluid dynamics) diagram, the influence of different compiling options on CFG structure is not fully considered, and the searching precision is not enough to meet the large-scale vulnerability searching work in millions of firmware.

Disclosure of Invention

Therefore, the invention provides a hierarchical learning-based cross-architecture vulnerability mining method, which improves the precision of large-scale vulnerability search work and the vulnerability search accuracy and precision by rich feature extraction based on two-level hierarchical learning.

According to the design scheme provided by the invention, a cross-architecture vulnerability mining method based on hierarchical learning is provided, which comprises the following steps:

acquiring an assembly program corresponding to a binary function through disassembling, extracting function characteristic information and a function calling relation of the assembly function in the assembly program through graphic description, combining the assembly functions in pairs, and adding similar labels to form a function pair as training sample data;

constructing a hierarchical learning model, wherein the hierarchical learning model comprises a function inner level learning module constructed based on a deep neural network and a function inter-level learning module constructed based on a map attention network; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model;

respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by utilizing the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; respectively and correspondingly feeding the function call relation and the middle embedding of each function in the function pair to an inter-function feature learning module in the two models for training and learning, and acquiring high-dimensional vector representation of the function features; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;

and aiming at the target function, acquiring an assembler program corresponding to the target function through disassembling, creating a control flow graph for the assembler function in the assembler program, extracting function characteristic information and a function call relation which are input as a hierarchical model through graphic description, and finishing target function vulnerability mining through training the learned hierarchical model.

As the cross-architecture vulnerability mining method based on hierarchical learning, further, in the graph description, a control flow graph, a data flow graph and a function call graph are created aiming at an assembler, function instructions are divided by basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is added to the structure of the control flow graph, a symbol mark for representing whether data transmission exists between the two basic blocks is added, edges between the nodes in the control flow graph represent the direction of a control flow, and the labels of the edges represent whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.

As the cross-architecture vulnerability mining method based on hierarchical learning, further, the determining of data transmission information further comprises: whether there is a data transfer between two basic blocks is determined by checking whether instructions in the basic blocks access the same address register.

As the cross-architecture vulnerability mining method based on hierarchical learning, further, in a function call graph, for a dynamically loaded third-party function library, function names are obtained by extracting an import table of a binary executable file; representing the call-called relation between functions as a directed unweighted edge; and constructing an adjacency matrix of the function calling relation according to the function calling graph.

As the cross-architecture vulnerability mining method based on hierarchical learning, the method further adopts a model-oriented genetic algorithm to extract function characteristic information, and in the genetic algorithm, a population is used for representing a function characteristic subset, and a generation is used for representing the number of iteration rounds.

As the cross-architecture vulnerability mining method based on hierarchical learning, the invention further extracts function characteristic information by using a genetic algorithm, and the method comprises the following contents: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations.

As the cross-architecture vulnerability mining method based on hierarchical learning, the Structure2Vec is used in a function inner level learning module, the vertex characteristics of a control flow graph are recursively aggregated in a nonlinear mode according to graph topology, multidimensional embedding containing neighborhood information and multidimensional vertex characteristics of vertexes with data transmission of the vertexes is generated in each iteration, and control flow graph embedding vectors are aggregated after multidimensional embedding is generated on each vertex.

As the cross-architecture vulnerability mining method based on hierarchical learning, the updating and embedding in iteration are further carried out through the nonlinear transformation defined by a complete connection network in each iteration; after iteration, each vertex feature will propagate to other nodes associated with it, and each vertex embedding contains context semantics.

As the cross-architecture vulnerability mining method based on hierarchical learning, the importance is calculated by using a shared attention mechanism in an inter-function level learning module, the attention coefficient for nonlinear transformation to obtain function embedding is obtained, and the function embedding which is used as the module input and contains the called function semantics is obtained by combining each characteristic vertex shared parameterized weight matrix.

As the cross-architecture vulnerability mining method based on hierarchical learning, the cosine distance is further utilized to obtain the function similarity, and the model quality is evaluated by comparing the similarity with the similar label difference.

The invention has the beneficial effects that:

the efficiency and the precision of large-scale vulnerability search work are greatly improved through two-stage machine learning and rich feature extraction; by introducing the concept of function structured signature, describing binary codes of a target function by using a graph, extracting signature information from the graph, increasing the semantics of a function call relation, performing feature selection by using a model-oriented feature selection method, and adding function data stream information to enrich the semantics in the function, on the premise that the search efficiency is enough to meet the requirement of large-scale vulnerability search work in the real world, the accuracy and precision of vulnerability search are obviously improved, and the method has important guiding significance on network information safety.

Description of the drawings:

FIG. 1 is a schematic diagram of a cross-architecture vulnerability mining method in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hierarchical model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an intra-function level learning module according to an embodiment of the present invention;

FIG. 4 is a block diagram of an inter-function level learning module in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a self-attention mechanism according to an embodiment of the present invention.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

In the prior art, extracting and utilizing characteristics of a binary function to deal with the challenge of huge differences of binary functions in different compiling environments, for this reason, in an embodiment of the present invention, a cross-architecture vulnerability mining method based on hierarchical learning is provided, as shown in fig. 1, including:

s101, acquiring an assembly program corresponding to a binary function through disassembling, extracting function characteristic information and a function calling relation of the assembly function in the assembly program through graphic description, combining the assembly functions in pairs, and adding similar labels to form a function pair as training sample data;

s102, constructing a hierarchical learning model, wherein the hierarchical learning model comprises a function inner level learning module constructed based on a deep neural network and a function inter-level learning module constructed based on a map attention network; cloning the hierarchical learning model to obtain a cloned version hierarchical learning model;

s103, respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by using the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; respectively and correspondingly feeding the function call relation and the middle embedding of each function in the function pair to an inter-function feature learning module in the two models for training and learning, and acquiring high-dimensional vector representation of the function features; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;

s104, aiming at the target function, acquiring an assembler corresponding to the target function through disassembling, creating a control flow graph for the assembler in the assembler, extracting function characteristic information and a function call relation which are input as a hierarchical model through graph description, and finishing target function vulnerability mining through the hierarchical model after training and learning.

The precision of large-scale vulnerability search work is improved through rich feature extraction based on two-level hierarchical learning, and vulnerability search accuracy and precision are improved.

As a cross-architecture vulnerability mining method based on hierarchical learning in the embodiment of the present invention, further, in graph description, a control flow graph, a data flow graph and a function call graph are created for an assembler, function instructions are divided by basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is attached to a control flow graph structure, a symbol mark for indicating whether data transmission exists between two basic blocks is added, an edge between nodes in the control flow graph represents a direction of a control flow, and a label of the edge represents whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.

Control flow diagrams and data flow diagrams describe the control and data flow of functions. The instructions of a function may be divided into several basic blocks, but the nodes of the control flow graph CFG and the data flow graph DFG are composed of different divided basic blocks. Therefore, in embodiments, data transfer information is appended to the structure of the CFG, and may be labeled 0 and 1 to indicate whether there is data transfer between two basic blocks. The edges between CFG nodes represent the direction of control flow; the label on the CFG edge indicates whether there is data transfer between the two basic blocks. The function control flow and the data flow have certain robustness on different architectures, different operating systems and different compiling optimization levels. The impact can be eliminated by extracting semantic features of control and data streams as functions, i.e., extracting common features across platforms and different compilation options that are independent of platform and compilation settings. In the embodiment of the invention, further, a model-oriented genetic algorithm is adopted to extract the function characteristic information, in the genetic algorithm, the population is used for representing the function characteristic subset, and the generation is used for representing the number of iteration rounds. The function call describes the call relationship between the functions to be analyzed. The function call graph relates to the function call relationship of the entire binary file. Each node represents a function, each function node may be represented by a function CFG with data flow information. Even in different compilation environments, the call relationship between functions is a very robust feature. There is no difference between functions that the same function calls in different environments. Therefore, in the embodiment of the invention, the calling relation between the functions is combined with the idea of crowd classification in the social network, and the influence of the calling function on function identification is considered, so that more accurate function characteristic representation is generated.

The binary functions are described by using the three graphs, and signature information is generated for the binary functions based on the graphs, so that basic block-level features which are still stable under different optimization options and cross-architecture can be extracted, and the binary codes of the target function are described by introducing the concept of function structured signature and using graphs to extract the signature information from the graphs by combining CFG attached with data flow information and calling function information in the function calling graph as signature factors of the functions, so that the comparison is facilitated, and the feature extraction precision and efficiency are improved.

Two neural networks are used to learn two-level features of the function. For the intra-function level learning model, based on Gemini, and adding data flow information to capture more function semantic information. Since the number of function calls is variable and the relationships between function calls are not well expressed on the Gemini model, handling function call relationships is more difficult. Therefore, new mechanisms or models are introduced to learn the semantic information of function calls. Finally, inspired by social networks, the embodiments of the present invention introduce an attempt to seek to solve the above problems in an effort to try GAT.

The deep neural network DNN is used to train basic block-level features of the learning function, while GAT is used to train the impact of call relations on the function to generate a high-dimensional feature vector containing more precise semantics. And finally, measuring the similarity between the functions by calculating the distance of the feature vectors of the functions, thereby determining the vulnerability function.

Existing semantic learning methods rely on the CFG of a function to extract features of each basic block of the function and perform similarity comparisons based on these features. BiN features are extracted by first disassembling the binary file into a corresponding assembler using an IDAPro tool. Then, a CFG is created for each assembly function using IDAPython provided by IDAPro. Meanwhile, in the embodiment of the present invention, a plug-in named MIASM of IDAPro may be used to determine whether there is data transmission between two basic blocks. Some changes may be made to the features extracted in the Gemini to take into account the effects of the compilation environment on the assembly code. Different from the function characteristics selected by Gemini directly using DiscovRE based on a graph matching algorithm, the embodiment of the invention designs a model-oriented genetic algorithm and reselects the function characteristics more suitable for the model. Extracting function characteristic information by using a genetic algorithm, wherein the function characteristic information comprises the following contents: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations. For example, 50 functional features are extracted and the 9 features that perform best are selected. The algorithm 1 provides a model-oriented genetic algorithm for selecting binary function features, and the specific design content is as follows:

in the algorithm model, the "population" (population) is the selected subset of the functional features, and the "generation" refers to the number of rounds. For each population in the model, first a pairing set (matching set) and offspring (offset) are initialized, pairs of functions that implement the labels that calibrate the facts are passed into the model, and the fitness (fitness) of the population is obtained. They are then ranked according to their fitness and a random sampling with substitutions (replacement), crossovers (crossovers) and mutations (mutations) is used to select the population. Finally, the population is updated. After T generations, the subset with the best performance is selected as the final selection. The features shown in table 1 are the initial features of each basic block used, including 8 statistical features and 1 structural feature. The initial features of each basic block (9-dimensional vector) of the function are input into the model to generate a semantic embedded vector of the function.

In addition, data flow is a reliable representation, data flow between basic blocks can capture data and variable dependencies, and data flow analysis can provide the ability to locate inline code. To facilitate coordination with the structure of the CFG, information of the data stream is appended to the basic blocks of the CFG structure. At the edges of the CFG, 0 and 1 are marked to indicate whether there is data transfer between the two basic blocks to enrich the functional semantics. In addition, the data transfer between two basic blocks is determined by checking whether the instructions in the basic blocks access the same address register. Since it considers control structure and data flow information within the function, it can effectively mitigate structural changes of the CFG due to different compilation environments. However, when extracting the function features from the binary file, the method cannot capture the interaction between the functions, and a large amount of structural information is lost. In order to obtain the structural information, a new function structural signature is introduced, and the calling relationship between functions and the internal control flow information of the functions are used as the characteristics for describing the functions. The function call graph FCG is a directed graph representation in which the vertices of the graph represent functions and the directed edges represent the calling relationships of the functions. When the FCG is extracted, for the dynamically loaded third-party function library, the function name is obtained as a tag by reading the import table of the binary executable file. For a local function, the start address of the local function may be used as a function tag. The call-called relationships between functions are represented as directed unweighted edges. Finally, an adjacency matrix of function call relationships is constructed from the FCGs. In the model, function features are extracted by considering only internal functions of the file and statically linked third-party library functions, and the function call graph constitutes an adjacency list to represent function call relations of the binary file, and the extracted features constitute a structured signature of the function.

In training and learning through the model, a mapping phi is required to be found, and the mapping phi converts a binary function into a representation form of a high-dimensional feature vector. Given two binary functions f₁And f₂Then the similarity of the two functions is judged by a given similarity calculation function Sim (·,), i.e. if the two functions are similar, then Sim (phi (f) is₁)，φ(f₂) ) has a value of 1. If the two functions are not similar, then Sim (phi (f)₁)，φ(f₂) ) has a value of-1.

In the model, an attempt is made to learn how to generate function embeddings by deep learning (i.e., mapping φ). Different from Gemini, the two-stage characteristics of the function, the characteristics of the internal function and the calling relationship among the functions can be fully considered, in the embodiment, the vector representing more semantics of the function is generated by establishing the learning model of the two-stage characteristics and reasonably and effectively integrating the training model of the two-stage characteristics, so that the accuracy is further improved.

Referring to FIG. 2, the input is an extracted CFG containing data stream information extracted from firmware binary code and extracted basic block-level features, and N is in binary codeA function. Inputting the above-mentioned data into the feature learning model in the function, and obtaining intermediate embedding; 5 iterations were performed in the intra-function feature learning model, each iteration having 2 fully connected neural networks to train the features. The intermediate embedded content is then fed into an inter-function feature learning model, as shown in fig. 2, with 3 hidden layers in the model. And determining the dependency relationship of the function by using the adjacency matrix, calculating the influence of adjacent nodes on the function node through an attention mechanism, and generating a final representation form which is not related graphically and contains function calling information. By obtaining a mapping phi, the binary function is converted to a high-dimensional representation using deep learning. Next, the method of similarity between functions is described by a high-dimensional representation of the functions, and the model is trained in this way. In data preprocessing, the function set is divided into L pairs of function pairs. If the pair is two identical functions compiled from the same source in different compilation environments, i.e., Sim (f)_i)，φ(f_i')) then the function assigns a 1 to the fact label

Otherwise, if there are two different functions, i.e., Sim (f)_i)，φ(f_j) The function assigns-1 to the fact label

Furthermore, the similarity of the functions can be described by the cosine distance as

In the training phase, the model evaluates the quality of the mapping by comparing the generated similarity to the difference of the fact labels of the function pairs. The Mean Square Error (MSE) can be used as a metric, and the formula is as follows:

then, training the co-in-function modelShared parameter W₁,W₂,P₁,…,P_nAnd a shared parameter W in the inter-function model,

to minimize the MSE in the equation. Furthermore, to improve the generalization capability of the model, the model was built by adding L2 regularization and Dropout to prevent over-fitting. MSE was optimized using a random gradient descent algorithm. Finally, once the optimal values of the sharing parameters are obtained, the function can be easily converted to a high-dimensional representation for functional similarity comparison.

The internal feature learning model of the function is an improvement of the Structure2vec model of Gemini, and the main purpose of the internal feature learning model is to obtain high-dimensional vectors representing functions, such as control flow and data flow, i.e. to generate embedding. Structure2Vec is inspired by graph model inference algorithms. The features of its vertices are recursively non-linearly aggregated according to the graph topology. After sufficient iterations have been performed, each vertex will contain information about neighboring vertices. After extracting the basic block-level 9-dimensional feature representation of the function in the target binary file, the features are input into a learning model to generate semantic embeddings for similarity computation.

In FIG. 3, (a) is a CFG, the data transmission of which is shown as

There are 3 containing basic block level features in the figure

Of which

And ξ are the set of vertices and edges. After T-layer iterations, the DNN model will be for each vertex

Generating a p-dimensional embedding, and each iteration generating a p-dimensional vertex feature mu containing neighborhood information thereof and feature information of a vertex having data transmission with the vertex_i. Generating each vertexIs embedded in^TThen, the embedded vector μ of g will pass the formula

Performing polymerization calculation; (b) a method of updating the embedding in each iteration is shown. Further, an embodiment of the present invention further provides an update embedding method, and specific contents may be designed as follows:

wherein x_vIs a vector of d-dimensional values, W, for each vertex₁Is a d × p matrix. N (μ) is represented as a set of control flow neighbor vertices for vertex μ, and D (μ) is represented as a set of neighbor vertices with data transfer, respectively. Further, σ_cAnd σ_dIs two non-linear transformations σ (-) defining n layers of fully connected networks to enable the process of collecting other vertex features, e.g.

Wherein P is_i(i ═ 1.., n) and P_i' (i ═ 1.., n) is a p × p matrix, and n is the embedding depth. ReLU is the activation function. Algorithm 2 outlines the overall algorithm for level embedding within the generating function, and the specific contents are as follows:

after T iterations, the features of each vertex will propagate to other vertices associated with it, and each vertex embeds semantics that all contain context. For the intra-function feature learning model, more functional semantic information is captured by adding data flow information compared to the existing research of DiscovRE, Genius and Gemini.

The function call relation has strong semantics under different compiling environments. Handling function call relationships is more difficult because the number of function calls is variable and the relationships between function calls are not well expressed on the Gemini model. Thus, a new mechanism or model may be introduced to learn the semantic information of the function call. Ideally, it is desirable that the model have the following characteristics:

(1) the model may act on the neighborhood of nodes, i.e. function calls;

(2) the model may assign different importance to adjacent nodes of the function;

(3) the model is suitable for generalizing problems and can handle any untrained graph structure.

It should be noted that the function does not have a fixed number of adjacent nodes in the call relation matrix. Therefore, in the enlightenment of the social network, when the embedded network of the function calling relation is built, the model is modified on the basis of the GAT network. GAT addresses some of the problems in graph convolution using a hidden self-attention layer. No complex matrix operations or a priori knowledge of the graph structure are required. By stacking the self-attention layers, different importance is assigned to different nodes adjacent to each other during the convolution process. In addition, the GAT is independent of the global graph structure due to the edge mechanism, and therefore can be easily applied to the induction problem. Inspired by the GAT model, the inter-function learning model includes the main steps that, as shown in FIG. 4, the input is a set of function embeddings, i.e., a high-dimensional representation of the functions generated in the previous stage, i.e.

Where N is the number of functions and F represents the dimension of the feature in each function. In addition, M can be represented by a contiguous matrix of binary file function calls. First order neighbor node set C considering each function node i_iThe model will generate a final function vector representation

According to the attention coefficient alpha of the node i of each called function node j_ij。

In order to convert the input features into a higher level of function embedding that contains the semantics of the called function, it is necessary to find a way to indicate the importance of each function node to which function node i calls. By self-attentive to function nodes, i.e. using a shared attention mechanism

To calculate importance; it is expressed as an attention coefficient

Wherein

And

is the initial feature of two functions, W is the shared parameterized weight matrix applied to each feature vertex for linear transformation. Once the normalized attention coefficient α is obtained_ijIn a non-linear transformation

Then calculating the final function embedding, wherein C_iIs the function called by function i.

Weighting matrix

Initialization is a linear transformation shared between function nodes. Each vertex also needs a self-attention mechanism to compute the importance of the function vertex j to vertex i, finding a way to obtain importance between two functions using the initial feature vector.

An overview of the self-attention mechanism is shown in fig. 5. In this embodiment model, the attention mechanism is a single layer feedforward neural network that passes parameterized weight vectors

To achieve the importance of a compute node, it can be set as:

wherein LeakyReLU is a non-linear activation function ·^TIndicating transposition and | | indicating concatenation operation. Considering efficiency issues, only the first order key is used to calculate e_ijAnd use Softmax to convert e_ijNormalization

Now, the attention coefficient is obtained and the final functional representation can be calculated by the following formula:

ELU and tf max were chosen as σ (-). In order to ensure the stability of the attention mechanism learning process, a multi-head mechanism is added, namely K independent attention mechanisms are executed simultaneously, and a representation of each function is generated through averaging.

The intra-function feature learning model implementation is represented as algorithm 3:

unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

Based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.

Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.

The system/apparatus provided by the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, no mention is made in the system/apparatus embodiments, and reference may be made to the corresponding contents in the foregoing method embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the system/apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A cross-architecture vulnerability mining method based on hierarchical learning is characterized by comprising the following contents:

respectively inputting the characteristic information corresponding to the assembly function of the training sample data function pair into a hierarchical learning model and a clone edition hierarchical learning model, respectively training and learning a function inner level learning module in the two models by utilizing the function characteristic information of each function in the function pair, and acquiring a function characteristic vector embedded as the middle; feeding the function call relation of each function in the function pair and the function characteristic vector embedded in the middle into the inter-function characteristic learning modules in the two models for training and learning, and obtaining the high-dimensional vector representation of the function characteristic; calculating the similarity represented by function feature high-dimensional vectors obtained by the hierarchical learning model and the clone edition hierarchical learning model, and judging and adjusting parameters and weights of the hierarchical model according to the similar labels to obtain a hierarchical model for training and learning and used for target function vulnerability mining;

2. The method for mining the cross-architecture vulnerability based on the hierarchical learning of claim 1, characterized in that in the graph description, a control flow graph, a data flow graph and a function call graph are created for an assembler, function instructions are divided into basic blocks, nodes of the control flow graph and the data flow graph are composed of different basic blocks, data transmission information is attached to the structure of the control flow graph, a symbolic mark for indicating whether data transmission exists between the two basic blocks is added, edges between the nodes in the control flow graph represent the direction of a control flow, and the labels of the edges represent whether data transmission exists between the basic blocks; the function call graph is represented by a directed graph, graph edges represent function call relations, and graph nodes are represented by a function control flow graph with data flow information.

3. The method of claim 2, wherein determining data transmission information further comprises: whether there is a data transfer between two basic blocks is determined by checking whether instructions in the basic blocks access the same address register.

4. The method for mining the cross-architecture vulnerability based on the hierarchical learning of claim 2, wherein in the function call graph, for a dynamically loaded third-party function library, the function name is obtained by extracting an import table of a binary executable file; representing the call-called relation between functions as a directed unweighted edge; and constructing an adjacency matrix of the function calling relation according to the function calling graph.

5. The method for mining the cross-architecture vulnerability based on the hierarchical learning according to any one of claims 1 to 4, characterized in that a model-oriented genetic algorithm is adopted to extract function feature information, wherein in the genetic algorithm, a population is used to represent a function feature subset, and a generation is used to represent the number of iteration rounds.

6. The method of claim 5, wherein the function feature information is extracted by a genetic algorithm, and the method comprises the following steps: initializing a pairing set and descendants, and transmitting function pairs into a model to obtain population fitness; and ranking according to population fitness, selecting a population by using random sampling with substitution, crossing and variation, updating, and determining the finally selected function characteristics after multiple iterations.

7. The method of claim 1, wherein in the intra-function level learning module, Structure2Vec is used, vertex features of the control flow graph are recursively and nonlinearly aggregated according to graph topology, multidimensional embedding including neighborhood information and multidimensional vertex features of vertices having data transmission with the vertices is generated each time of iteration, and control flow graph embedding vectors are aggregated after multidimensional embedding is generated for each vertex.

8. The method of claim 7, wherein in each iteration, updating and embedding in the iteration are performed through nonlinear transformation defined by a fully-connected network; after iteration, each vertex feature will propagate to other nodes associated with it, and each vertex embedding contains context semantics.

9. The method of claim 1, wherein in the inter-function level learning module, a shared attention mechanism is used to calculate importance, and obtain attention coefficients for nonlinear transformation to obtain function embedding, and in combination with each feature vertex shared parameterized weight matrix, function embedding containing called function semantics is obtained for use as module input.

10. The method of claim 1, wherein the cosine distance is used to obtain functional similarity, and model quality is evaluated by comparing similarity with similar label differences.