CN113742733A

CN113742733A - Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device

Info

Publication number: CN113742733A
Application number: CN202110909147.2A
Authority: CN
Inventors: 李莉莉; 孙小兵; 薄莉莉; 魏颖; 李斌
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-12-03
Anticipated expiration: 2041-08-09
Also published as: CN113742733B

Abstract

The invention discloses a method and a device for reading and understanding vulnerability event trigger word extraction and vulnerability type identification, wherein the method comprises the following steps: collecting vulnerability data; vulnerability description statement representation learning; constructing a syntactic dependency relationship of the vulnerability description text by using a Graph Convolution Network (GCN), and extracting vulnerability characteristics; and recognizing and classifying vulnerability event trigger words based on the question-answering task in the BERT fine tuning model. The vulnerability classification method can better utilize grammar and semantic information in vulnerability description, fully excavate context information in vulnerability description, achieve recognition and classification of vulnerability event trigger words, solve the problem of inaccurate vulnerability classification to a certain extent, capture the dependence relationship among different events compared with the current popular event trigger word extraction method, and output the trigger words of vulnerability events to assist developers in analyzing vulnerabilities.

Description

Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device

Technical Field

The invention belongs to the field of software security, and particularly relates to a method and a device for reading and understanding vulnerability event trigger word extraction and vulnerability type identification.

Background

Software bugs can weaken the security of computer software, leading to problems of data loss and tampering, privacy disclosure, and the like. From the reason of the bug generation, the software bug mainly comprises buffer overflow, no expected check on the input content and the like. With the rapid development of computers and the internet and the increase of 0-day bugs, software bugs cause huge damage to individuals, communities and countries, and in order to perform security assessment, it is necessary to identify and classify bug trigger words (occurrence causes). In previous work, features are learned from original texts through typical neural networks (CNNs, RNNs and the like), and some additional fine-grained information is used to improve representation, such as entity-level features, document-level features and grammar-level features, aiming at locating and identifying the classification of each event trigger/parameter. Much recent work has explored the use of pre-trained language models for feature learning. Because pre-trained language models can use a large amount of unlabeled data to learn the universal language representation, the method for learning features by using the pre-trained language models is usually improved considerably compared with the method for learning features by using a traditional neural network, but the pre-trained models are not combined with other fine-grained information, so that the model loses syntactic dependencies in vulnerability description sentences, and the recognition accuracy of trigger words is not high.

At present, some work uses a machine learning/deep learning method to extract event trigger words and identify event types, for example, a new joint multi-event Extraction framework is proposed in a document "join Multiple Events Extraction view attachment-based Graph Information Aggregation", word characterization is performed by connecting word embedding vectors, part-of-speech tagging vectors, position embedding vectors and entity tagging vectors in series, and a grammar shortcut arc is introduced to enhance Information flow and an attention-based Graph convolution network to model Graph Information, so that a plurality of event trigger words and parameters are Jointly extracted. However, the generalization capability of the word vector of the model is poor, and the character level, the word level, the sentence level and the relation characteristics among the sentences cannot be fully described. Some works begin to use a pre-training method to identify vulnerability trigger words, for example, in the document "Event Extraction as Machine Reading comparison" that sentence representation learning is performed by using a BERT pre-training model, trigger words of events are extracted based on Reading understanding tasks, and events are classified by using a logistic regression model, but the grammatical relation in texts is not used, and the association among a plurality of events cannot be captured, so that the Event type identification is limited.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention provides a method and a device for reading and understanding vulnerability event trigger word extraction and vulnerability type identification, which can be used for determining the cause of vulnerability generation and assisting developers in vulnerability repair.

The technical scheme is as follows: the invention relates to a reading understanding vulnerability event trigger word extraction and vulnerability type identification method, which specifically comprises the following steps:

(1) acquiring vulnerability data, acquiring a CVE-ID of a vulnerability entry, vulnerability description and vulnerability type corresponding to each ID, and designing a question Q for a vulnerability event;

(2) based on a BERT pre-training model, performing vulnerability description statement representation learning as initial node characteristics input by GCN;

(3) extracting node characteristics of vulnerability information by using a Graph Convolution Network (GCN);

(4) and recognizing and classifying vulnerability event trigger words based on the question-answering task in the BERT fine tuning model.

Further, the step (2) comprises the steps of:

(21) converting the description Text of the designed question Q and the vulnerability entry into an input sequence of a BERT pre-training model; the method is characterized in that a special mark [ CLS ] is placed at the beginning for fusing semantic information of each word in description, and problem and vulnerability descriptions are separated by using [ SEP ]; converting each word into Token embedding, Segment embedding and Position embedding, and summing the embedded representations to obtain a representation vector;

(22) and transmitting the expression vector to an encoder layer of the BERT, predicting a next sentence task by utilizing a Transformer in combination with a mask language model to realize a bidirectional language model task, and performing expression learning to obtain an embedded vector X serving as an initial node characteristic input by the GCN.

Further, the step (3) includes the steps of:

(31) based on the text description of the vulnerability entry, acquiring the syntactic dependency relationship of the vulnerability description text by using a Stanford syntactic analysis tool;

(32) constructing a syntax information graph G (V, E) of the vulnerability description according to the syntax dependency relationship; where V is the word node V₁,v₂,...,v_i...,v_nSet of (v)_iRepresenting the ith word in the vulnerability description, n being the number of words in the vulnerability description, E being the node v_iTo node v_jDirected edge (v)_i,v_j) A set of (a); adding a reverse edge (v) to each directed edge_j,v_i) Each node v_iAdding a self-looping edge (v)_i,v_i) And adding a relationship type label K (v) for each edge_i,v_j)；

Obtaining the adjacency matrix A based on the syntactic information graph G, i.e. if the node v_iAnd node v_jConnected to and adjoining the element a of the ith row and jth column in matrix A_ij1, otherwise a_ij＝0；

Is a normalized matrix of the adjacency matrix A, and is obtained by the following transformation:

wherein a' ═ a + I, where I is the identity matrix;

is the degree matrix of A';

(34) carrying out gradient descent training on the information of the vulnerability nodes, extracting characteristics of the vulnerability nodes, and transforming as follows:

in the formula (I), the compound is shown in the specification,

vulnerability node information input by the l layer of the graph convolution neural network; using normalized matrices

And a label K (v) of a specific type per layer_i,v_j) Weight matrix of

Linear transformation is carried out, and then the nonlinear activation function sigma is carried out to obtain the input loophole node information of the next layer

Performing convolution training for multiple times to obtain a feature vector of the vulnerability node;

(35) the operation is also carried out for the question asked by the bug event trigger word, the syntactic dependency relationship is constructed, and the feature vector of the question sentence is obtained.

Further, the steps include the steps of:

(41) the problem description characteristic vector A and the vulnerability description characteristic vector B are accessed to a full connection layer and a softmax layer in a BERT question-answering task;

(42) introducing a starting vector S and an ending vector E for the BERT question-answering task, and calculating the probability P that the ith word in the vulnerability description starts as the answer span_iThe beginning of the most probable word as the span of answers is transformed as follows:

wherein, T_iIs the feature vector of word i; using formula in the same way

Calculating the end of the answer span; defining the score of the candidate answer from position i to position j as S_i,j＝S·T_i+E·T_jTaking the maximum score span when j is larger than or equal to i as a prediction result;

simultaneously, the non-answer prediction is carried out, and the question without answer is regarded as [ CLS ]]Marking the answer span of the beginning and the end, calculating the score S without answer_null(ii) S · C + E · C, wherein C is a special mark [ CLS]The vector of (a);

will have no answer span S_nullScore of (2) and score of best non-null span S_i,jComparing; when S is_i,j>S_nullWhen the answer is positive tau, the tau is a self-defined threshold value, a non-null answer is predicted, and the answer is a vulnerability event trigger word;

(43) based on the vulnerability event trigger words, the feature vector of each word is used as the input of a logistic regression model, and the probability that the trigger words belong to different vulnerability types is calculated to predict the categories of vulnerability events.

Based on the same inventive concept, the invention also provides a device for extracting the read understanding vulnerability event trigger words and identifying the vulnerability type, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the method for extracting the read understanding vulnerability event trigger words and identifying the vulnerability type when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: 1. constructing word vector representation of vulnerability description by using a BERT pre-training model, and fully describing character level, word level, sentence level and relation characteristics among sentences; 2. expressing syntax information of vulnerability description from the angle of a graph, constructing syntax dependence relationship of vulnerability description, bringing relationship among words into the learning process of a model, and giving different weights to different types of relationships to learn the influence of different dependence relationships on trigger word identification and classification effects; 3. different from the traditional vulnerability classification method, the vulnerability classification method uses the vulnerability trigger words and the vulnerability types as final output results, can clarify the causes of vulnerability generation, and assists developers in vulnerability repair.

Drawings

FIG. 1 is a flowchart of a method for reading and understanding vulnerability event trigger word extraction and vulnerability type identification;

FIG. 2 is a description of vulnerability entries and their types;

FIG. 3 is a syntactic dependency of a vulnerability description;

FIG. 4 is a schematic diagram of the BERT question-answering task.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention provides a reading understanding-based vulnerability event trigger word extraction and vulnerability type identification method, which specifically comprises the following steps of:

step 1: and (5) vulnerability data acquisition.

(1.1) acquiring CVE-IDs of all vulnerability entries from the vulnerability database NVD in 1999, and vulnerability descriptions and vulnerability types corresponding to each ID; as shown in fig. 2.

(1.2) firstly, making the description of the vulnerability entries collected in the step (1.1) into word lists WordList with sequence numbers corresponding to words one by one and without repetition; then, a question Q is designed for the vulnerability event: what is the house of the vulgaris?

Step 2: and (4) performing vulnerability description statement representation learning based on a BERT pre-training model, and using the vulnerability description statement representation learning as initial node features input by the GCN.

And (2.1) converting the question Q designed in the step (1.2) and the description Text of the vulnerability entry into an input sequence of a BERT model. Namely, a special mark [ CLS ] is placed at the beginning for fusing semantic information of each word in the description, and the problem description and the vulnerability description are separated by using [ SEP ]. Converting each word into Token embedding (based on WordList generation in step (1.2), converting a word into a vector of fixed dimensions), Segment embedding (representing the sentence to which the word belongs) and Position embedding (indicating the Position information of the word in the sentence), and summing these embedded representations to obtain a representation vector.

(2.2) passing the representation vector generated in step (2.1) to the encoder layer of the BERT. And (3) realizing a bidirectional language model task by using a Transformer in combination with a mask language model and predicting a next sentence task, and performing expression learning to obtain an embedded vector X as an initial node characteristic input by the GCN.

And step 3: and extracting the node characteristics of the vulnerability information by using a Graph Convolution Network (GCN).

And (3.1) acquiring the syntactic dependency relationship of the vulnerability description text by using a Stanford syntactic analysis tool based on the vulnerability description obtained in the step (1.1), as shown in FIG. 3.

And (3.2) constructing a syntax information graph G which describes the text according to the syntax dependency relationship obtained in the step (3.1), wherein the syntax information graph G is (V, E). Where V is a vulnerability node V₁,v₂,...,v_i...,v_nSet of (v)_iRepresenting the ith word in the vulnerability description, n being the number of words in the vulnerability description, E being the node v_iTo node v_jDirected edge (v)_i,v_j) A collection of (a). To facilitate information flow, a reverse edge (v) is added to each directed edge_j,v_i) Simultaneously for each node v_iAdding a self-looping edge (v)_i,v_i) And adding a relationship type label K (v) for each edge_i,v_j)。

(3.3) obtaining the adjacency matrix A by constructing a syntax information graph G of the vulnerability description, namely if the node v_iAnd node v_jConnected to and adjoining the element a of the ith row and jth column in matrix A_ij1, otherwise a_ij0. The normalization process is performed on the adjacency matrix a, and can be obtained by the following transformation:

wherein I is a ═ a + I, where I is the identity matrix;

is the degree matrix of a'.

(3.4) carrying out gradient descent training on the vulnerability node information, extracting vulnerability node characteristics, and transforming as follows:

in the formula (I), the compound is shown in the specification,

the information of the bug nodes input by the l layer of the graph convolution neural network is obtained, when l is 0,

is the embedding vector X obtained in step (2.2); using normalized matrices

And a label K (v) of a specific type per layer_i,v_j) Weight matrix of

Linear transformation is carried out, and then the information of the next layer of input vulnerability nodes is obtained through a nonlinear activation function sigma

And performing convolution training for multiple times to obtain the characteristic vector of the vulnerability node.

And (3.5) performing the operation on the question asked by the vulnerability event trigger word, constructing the syntactic dependency relationship of the question and acquiring the feature vector of the question sentence.

And 4, step 4: and training a vulnerability event trigger word recognition and classification model based on the question-answering task in the BERT fine tuning model.

(4.1) accessing the feature vectors of the question and the vulnerability description obtained in the step 3 (the feature vector of the question is represented by A, and the feature vector of the vulnerability description is represented by B) to a full connection layer and a softmax layer in the BERT question answering task, as shown in FIG. 4.

(4.2) during the trimming, leadA start vector S and an end vector E are entered. Calculating the probability P of the beginning of the ith word as answer span in the vulnerability description_iThe beginning of the most probable word as the span of answers is transformed as follows:

wherein, T_iIs the feature vector of word i; using formula in the same way

Calculating the end of the answer span; defining the score of the candidate answer from position i to position j as S_i,j＝S·T_i+E·T_jAnd taking the maximum score span when j is larger than or equal to i as a prediction result.

Simultaneously, the non-answer prediction is carried out, and the question without answer is regarded as [ CLS ]]Marking the answer span of the beginning and the end, calculating the score S without answer_null(ii) S · C + E · C; wherein C is a special mark [ CLS]The vector of (2).

Will have no answer span S_nullScore of (2) and score of best non-null span S_i,jA comparison is made. When S is_i,j>S_nullWhen the answer is + tau (tau is a self-defined threshold), a non-null answer is predicted, and the answer is a vulnerability event trigger word.

And (4.3) based on the vulnerability event trigger words obtained in the step (4.2), taking the feature vector of each word as the input of a logistic regression model, and calculating the probability that the trigger words belong to different vulnerability types so as to predict the categories of vulnerability events.

According to the invention, answer span and classification are used as final output results, so that identification and classification of vulnerability event trigger words can be achieved, developers are assisted in locating the causes of vulnerability occurrence, vulnerability classification is beneficial to vulnerability management, and help is provided for vulnerability mitigation. The vulnerability classification method can better utilize grammar and semantic information in vulnerability description, fully excavate context information in vulnerability description, achieve recognition and classification of vulnerability event trigger words, solve the problem of inaccurate vulnerability classification to a certain extent, output the trigger words (generation reasons) of vulnerability events and assist developers in analyzing vulnerabilities compared with the current popular vulnerability classification method.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A reading understanding vulnerability event trigger word extraction and vulnerability type identification method is characterized by comprising the following steps:

2. The reading understanding vulnerability event trigger word extraction and vulnerability type identification method according to claim 1, wherein the step (2) comprises the steps of:

3. The reading understanding vulnerability event trigger word extraction and vulnerability type identification method according to claim 1, wherein the step (3) comprises the steps of:

(32) constructing a syntax information graph G (V, E) of the vulnerability description according to the syntax dependency relationship; where V is a vulnerability node V₁,v₂,...,v_i...,v_nSet of (v)_iRepresenting the ith word in the vulnerability description, n being the number of words in the vulnerability description, E being the node v_iTo node v_jDirected edge (v)_i,v_j) A set of (a); adding a reverse edge (v) to each directed edge_j,v_i) Each node v_iAdding a self-looping edge (v)_i,v_i) And adding a relationship type label K (v) for each edge_i,v_j)；

wherein a' ═ a + I, where I is the identity matrix;

is the degree matrix of A';

in the formula (I), the compound is shown in the specification,

And a label K (v) of a specific type per layer_i,v_j) Weight matrix of

4. The reading understanding vulnerability event trigger word extraction and vulnerability type identification method according to claim 1, wherein the steps comprise the steps of:

wherein, T_iIs the feature vector of word i; using formula in the same way

(43) based on the vulnerability event trigger words, the feature vector of each word is used as the input of a logistic regression model, and the probability that the vulnerability event trigger words belong to different vulnerability types is calculated to predict the categories of vulnerability events.

5. A reading understanding vulnerability event trigger word extraction and vulnerability type identification apparatus, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements the reading understanding vulnerability event trigger word extraction and vulnerability type identification method according to any one of claims 1-4.