CN113378176B

CN113378176B - Software vulnerability identification method based on graph neural network detection with weight deviation

Info

Publication number: CN113378176B
Application number: CN202110652749.4A
Authority: CN
Inventors: 李辉; 曲阳; 刘慧江; 汪海博; 赵娇茹; 刘勇; 郭世凯; 陈荣
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2023-06-23
Anticipated expiration: 2041-06-11
Also published as: CN113378176A

Abstract

The invention discloses a software vulnerability identification method with weight deviation based on graph neural network detection, which comprises the following steps: will beThe word and the symbol in the source code are regarded as a node, the source code is expressed by adopting a composition graph, and an initial characteristic value of each node and an initial characteristic of a connected edge of each graph are obtained; the generated constitution diagram is taken as input, the information output by the reset gate, the information output by the update gate and the information of the self node are finally combined, and then the node activation value is output through the Sigmoid activation function to be taken as the node state at the final moment; when the word nodes are sufficiently updated, the words are aggregated into a graphical level representation of the function code, and a final vulnerability recognition result is generated based on the representation; using

The Dice coefficient minimizes the defective software bug loss value, thereby identifying the software bug.

Description

Software vulnerability identification method based on graph neural network detection with weight deviation

Technical Field

The invention relates to the technical field of software monitoring, in particular to a software vulnerability identification method with weight deviation based on graph neural network detection.

Background

Software vulnerability detection is one of the main means to check and discover the existence of security vulnerabilities in software systems. The design errors, coding defects and operation faults of the software are searched by utilizing various tools to audit codes in the software or analyzing the execution process of the software. Early vulnerability detection techniques are classified into static analysis methods and dynamic analysis methods depending on whether program operation is relied on. As machine learning evolves, researchers have begun to attempt to combine traditional techniques with machine learning to predict code vulnerabilities. The static analysis method is generally applied to a development and coding stage of software, the software is not required to be run, and vulnerabilities are found by scanning information such as source code analysis lexical, grammar, control flow, data flow and the like. The dynamic analysis method is generally applied to a test operation stage of software, and in the process of operating a software program, vulnerabilities are discovered by analyzing information such as the state and execution path of the program in a dynamic debugger. The dynamic analysis method is executed by adopting symbols in the past to carry out vulnerability mining and the dynamic analysis method is adopted to check errors by adopting fuzzy test. Further to the following, researchers consider the source code as a flat sequence (resembling natural language), extracting features of the source code to improve vulnerability recognition. The latest progress of deep learning models in machine learning is to extract structural features of source codes by using Abstract Syntax Tree (AST) Control Flow Graph (CFG), data Flow Graph (DFG) and Natural Code Sequence (NCS), form a composite code representation with AST as backbone, and perform vulnerability recognition by gating a graph loop layer and a convolution conversion layer.

The effective representation of the source code is lacking. Most of the previous methods related to code prediction process source codes according to sequences, which makes it difficult for a model to learn the syntax structure information of the source codes. However, in contrast to weakly structured natural languages, programming languages are formal languages, and source code written with them is explicit and structured. To overcome limitations in code representation, many working views explore more code representation structures, such as tree structures or graph structures. Therefore, it is possible to efficiently represent the syntax structure information of the code by using a specific graph.

There is a lack of focus learning on defective samples. In the past, most of the work uses a cross entropy loss function in terms of loss functions, and the ratio of the loss function weight of a predicted defective sample to a non-defective sample is 1:1. However, in practical applications, defective samples often account for a small number of the total codes. When the weight of the loss function of the defective sample is the same as the proportion of the sample without defects, the model learns that the defective sample is insufficient, and the prediction result is biased to the code without defects. Therefore, consider how to increase the loss function weight of the defective sample, so that the model can effectively learn the characteristics of the defective sample.

Disclosure of Invention

According to the problems existing in the prior art, the invention discloses a software vulnerability identification method with weight deviation based on graph neural network detection, which comprises the following steps:

s1, regarding words and symbols in source codes as a node, and representing the source codes by adopting a composition graph to obtain an initial characteristic value of each node and an initial characteristic of a connected edge of each graph;

s2, taking the generated composition diagram as input, wherein node characteristics in the diagram representation are initial states Z, adding Z-D0S after the node input characteristics when the node initial states Z are larger than the node input characteristic dimension D, transmitting and receiving information by edges of different nodes in the diagram representation, determining how much information of the previous moment and the current moment needs to be transmitted to the next moment by the information received by the nodes through the updating gate, determining how much information of the previous moment and the current moment needs to be discarded by the node received by the node through the resetting gate, finally combining the information output by the resetting gate, the information output by the updating gate and the information of the node, and outputting a node activation value by a Sigmoid activation function as the node state of the final moment;

s3, after the word nodes are fully updated, the words are aggregated to a graphic level representation of the function codes, and a final vulnerability recognition result is generated based on the representation;

s4, using

The coefficients minimize the defective software vulnerability loss value, thereby identifying the software vulnerability.

Further, the initial feature of the connected edge of each graph in S1 adopts a NL Graph Embedding method to obtain the edge weight, which is specifically as follows:

the method comprises the steps that a sliding window word co-occurrence mechanism is adopted to pattern a source code, edges in a formed graph are characteristic values, nodes appear in a sliding window, and the weight of the edges is increased by one;

all source codes are de-duplicated, each word is mapped to a node, and all sliding windows are obtained;

processing each sliding window, regarding each word in each sliding window as a node V, and counting the current sliding window W _n One node V of (a) _i With another node V _j The number of times of co-occurrence between the two nodes, and adding the result to the previous number of times of co-occurrence of the two nodes as an edge initial feature E between the two nodes _ij 。

Further, the following specific mode is adopted in S2:

for each node v in the graph _i E V, initializing at the beginning, setting the initial state vector of each node as Z dimension, and inputting the characteristic x of each node _i The dimension is D dimension, if the state vector dimension Z of the node is larger than the node input feature dimension D, Z-D0 s are added behind the node input feature;

setting T as the total number of time steps, wherein in each time step T is less than or equal to T, each node not only receives information of adjacent nodes, but also sends information to the adjacent nodes, and the adjacent relation between the nodes is represented by an adjacent matrix; the interaction between nodes is expressed as the following formula

Wherein A is _v: Is two columns of in and out of the corresponding node selected from the matrix A, and the matrix A epsilon R ^D|v|×2D|v| ，A _v: ∈R ^D|v|×2D ，/>

The D|v|dimension vector is formed by gathering the characteristics of all nodes at the time t-1, and b is deviation;

at time step t, a further calculation is made using the following formulaNew door

Wherein->

For the input vector of the t-th time step, a linear transformation is performed,/for the input vector of the t-th time step>

The information of the previous time step t-1 is stored, and the update gate adds the two parts of information and activates the two parts of information by using a Sigmoid activation function;

at time step t, we use

Equation calculation reset gate, wherein ∈>

Input vector for the t-th time step, < >>

The information of the previous time step t-1 is stored;

the new memory content will use the reset gate to store the past related information, and its calculation expression is as follows

Wherein->

For the input vector of the t-th time step, a linear transformation, right-hand matrix W, is applied by calculating the reset gate +.>

And->

Hadamard product of (A), i.e.)>

And->

Is obtained by means of a reset gate +.>

Controlling the opening size of the gate control, adding the current information and the information of the previous moment, and activating by using a hyperbolic tangent activation function tanh;

finally calculate

The vector will retain the information that the current cell passed to the next cell, its computational expression is +.>

Wherein->

And->

The Hadamard product of (2) represents the information that was retained until the last time step, +.>

And->

The Hadamard product of (2) represents the information that the current memory retains to the final memory, and the sum of the two parts of information is equal to the output content of the final gating loop unit.

Further, the following specific mode is adopted in S3: the calculation process of the graphical level representation of the function code is defined as follows:

wherein f ₁ And f ₂ Is two multi-layer perceptrons (MLP), f ₁ As soft attention weight, f ₂ As a nonlinear feature transformation, a graph representation h for each code sample _G The application of a maximum pool function tag predicts +.>

Where W and b are weights and deviations.

By adopting the technical scheme, the software vulnerability recognition method with weight deviation based on the graph neural network detection can effectively represent the structural characteristics of the source code by converting the text format of the source code into the representation form of the source code graph, so that the model can learn the source code more fully. The number of defective samples in the source code is small, by

The coefficients replace the common cross entropy loss function to make the model pay more attention to defective samples, so that the quality of software is improved better.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the embedding process of the present invention;

fig. 3 is a schematic diagram of the construction process of the adjacency matrix a in the present invention.

Detailed Description

In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:

the software vulnerability recognition method with weight deviation based on the graph neural network detection shown in fig. 1 specifically comprises the following steps:

step 1: graph embedding model

The word and symbol in the source code are regarded as one node, and the source code composition graph representation g= { V, E } is taken as the input of GGNN. V represents a set of points and E represents a set of edges. It is first necessary to obtain an initial feature value of each node, and also an initial feature of the connected edge of each graph. For the initial feature of the edge, the graph embedding model uses the method of NL Graph Embedding to obtain the edge weight. For the initial feature of the node, a Glove trained word embedding dictionary (glove.6b.300) is used to obtain an embedding vector for each word.

NL Graph Embedding: the source code is patterned using a sliding window word co-occurrence mechanism. As shown in fig. 2, a window size equal to 3 is used. For NL "static int has_duration (avformat context_n)", all words are first de-duplicated, each word is mapped to an index, and then all sliding windows are acquired. Thereafter, processing is performed for each sliding window. Each word in each sliding window is considered as a node V, and the current sliding window W is counted _n One node V of (a) _i With another node V _j The number of co-occurrences between the two nodes is simply added to the previous number of co-occurrences of the two nodes as an edge initiation feature E between the two nodes _ij 。

Step 2: gate control graph neural network prediction module

Graph g= { V, E } generated by graph embedding model is taken as input, for each node V in the graph _i E V, initializing at the beginning, the formula is

Setting the initial state vector of each node to Z dimension, and inputting characteristic x of each node _i The dimension is D dimension. If the state vector dimension Z of the node is greater than the node input feature dimension D, Z-D0 s are added behind the node input feature.

T is the total number of time steps, and in each time step T is less than or equal to T, each node can not only accept the information of the adjacent node, but also send the information to the adjacent node. The adjacency between nodes is represented by an adjacency matrix. The interaction between nodes is expressed as the following formula

Wherein A is _v: Is two columns in and out of the corresponding node selected from the matrix A of FIG. 3 (c), matrix A εR ^D|v|×2D|v| ，A _v: ∈R ^D|v|×2D 。/>

Is a vector of dimension d|v| formed by aggregating the features of all nodes at time t-1. In addition, b is the deviation.

At time step t, we first calculate the update gate using the following formula

Wherein->

For the input vector of the t-th time step, a linear transformation (with weight matrix W ^z Multiplication). />

The information of the previous time step t-1 is stored and is also subjected to a linear transformation. The update gate adds the two pieces of information and activates using the Sigmoid activation function. The update gate can help the model decide how much past information to pass to the future, or how much information from the previous and current time steps needs to continue to pass. In this way the model can decide to replicate all the information from the past and thus reduce the risk of gradient extinction.

At time step t, we calculate the reset gate using the following formula

Wherein->

Input vector for the t-th time step, < >>

The information of the previous time step t-1 is stored. Resetting the gate is similar to updating, but the parameters and usefulness of the linear transformation are different, and he can help the model decide how much past information needs to be forgotten.

The new memory content will use the reset gate to store the past related information, its calculation expression is

Wherein->

The input vector for the t-th time step is subjected to a linear transformation, i.e. a right-hand multiplication matrix W. By calculating reset gate +.>

And->

Hadamard product of (A), i.e.)>

And->

And the corresponding element product of (c) to obtain the data after "reset". By resetting the gate->

The magnitude of the gating opening is controlled so the Hadamard product will determine the previous information to be retained and forgotten. The two parts are then combinedAfter addition of the calculation results of (2), activation is performed using a hyperbolic tangent activation function tanh.

At the end of the calculation

Wherein->

And->

And->

The Hadamard product of (c) represents the information that the current memory retains to the final memory. The two pieces of information add up to the content output by the final gating loop.

Step 3 readout Module

After the word nodes are sufficiently updated, they are aggregated into a graphical level representation of the function code and a final prediction is generated based on the representation. The calculation process of the graphical level representation of the function code is defined as follows:

wherein f ₁ And f ₂ Is two multi-layer perceptron (MLP). The former acts as a soft attention weight and the latter acts as a nonlinear feature transformation. In addition to the average weighted word features, we also represent h for each code sample graph _G Applying a maximum ofChi Hanshu, the idea is that each word plays a role in the code and the contribution of the keywords should be more explicit. Finally, the tag predicts +_by inputting the level vector into the softmax layer>

Where W and b are weights and deviations.

Step 4 weighting model

In practical application of code defect prediction, defective samples often occupy a few of all samples, and some defective code samples are misreported and not reported, because previous work usually uses a cross entropy loss function to measure model performance, the ratio of the loss function weight of the defective samples to the non-defective samples is predicted to be 1:1, which results in insufficient learning of defective sample characteristics with small occupied ratio and difficult recognition as full as possible. We consider the cross entropy loss function to be responsible for this problem. Let y denote the actual label as such,

representing the predicted output, the original cross entropy loss function is +.>

Therefore, we use +.>

Coefficients replace the cross entropy loss function. At->

In the coefficients>

Acting as a scaling factor, for simple samples (tending to 1 or 0) the model is made to pay less attention to them. Indeed, from the derivative point of view, once the model correctly classifies the current sample (just passing 0.5), DSC will make the model pay less attention to it, rather than encourage the model to approach 0 or like the cross entropy loss function1, which effectively avoids model training from being dominated by simple samples due to excessive simple samples. />

Coefficients of

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A software vulnerability recognition method based on a graph neural network detection with weight deviation is characterized by comprising the following steps:

s4, using

-the Dice coefficient minimizes a defective software vulnerability loss value, thereby identifying a software vulnerability;

the initial characteristics of the connected edges of each graph in S1 are obtained by adopting an NLGraphEmbedding method, and the specific mode is as follows:

processing each sliding window, regarding each word in each sliding window as a node V, and counting the current sliding window W _n One node V of (a) _i With another node V _j The number of times of co-occurrence between the two nodes, and adding the result to the previous number of times of co-occurrence of the two nodes as an edge initial feature E between the two nodes _ij

S2 specifically adopts the following mode:

at time step t, an update gate is calculated using the following formula

Wherein->

at time step t, we use

Equation calculation reset gate, wherein ∈>

Input vector for the t-th time step, < >>

The information of the previous time step t-1 is stored;

Wherein->

And->

Hadamard product of (A), i.e.)>

And->

Is obtained by means of a reset gate +.>

finally calculate

The vector will retain the information that the current cell passed to the next cell, its computational expression is

Wherein->

And->

And->

The Hadamard product of (2) represents the information which is reserved from the current memory to the final memory, and the sum of the two parts of information is equal to the output content of the final gating circulation unit;

s3, specifically adopting the following modes: the calculation process of the graphical level representation of the function code is defined as follows:

Where W and b are weights and deviations.