CN114254130A

CN114254130A - Relation extraction method of network security emergency response knowledge graph

Info

Publication number: CN114254130A
Application number: CN202210184821.XA
Authority: CN
Inventors: 车洵; 孙捷; 胡牧; 梁小川; 刘志顺; 金奎�
Original assignee: Nanjing Zhongzhiwei Information Technology Co ltd
Current assignee: Nanjing Zhongzhiwei Information Technology Co ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-03-29

Abstract

The invention discloses a relation extraction method of a network security emergency response knowledge graph, which comprises the following steps: giving a network security response knowledge text; vectorizing the knowledge text data, extracting vocabularies in the network security response knowledge text, and mapping the vocabularies to a K-dimensional vocabulary vector; extracting the position vector corresponding to the vocabulary vector and combining the current word entity

And entities

The relative distance between them is converted into vector representation; extracting semantic features of sentences by adopting a residual segmented convolutional neural network JRpcnn to form feature vectors,taking the vocabulary vectors and the position vectors corresponding to the vocabulary vectors as input residual error segmented convolution neural network JRpcnn; the method has the characteristics of effectively reducing the influence of noise data on remote supervision and more accurately extracting the entity relation from the network security emergency response text.

Description

Relation extraction method of network security emergency response knowledge graph

Technical Field

The invention relates to the field of network security emergency response, in particular to a relation extraction method of a network security emergency response knowledge graph.

Background

Network security emergency response refers to the computer dealing with a possible threat and what to do after the threat has occurred, based on its internally stored relevant security knowledge. The traditional network security passive defense method is difficult to rapidly deal with increasingly complex threats, people continuously innovate in the field of network security, and therefore the standard provided by the capability and efficiency of emergency command for dealing with unusual situations is higher. Therefore, people propose to use the knowledge graph to process the network security problem, the knowledge graph is a new idea for analyzing and processing data in the network security analysis, and the network security emergency response knowledge graph is generated due to operation. The network security emergency response knowledge graph is a data-driven, linear, very computationally powerful tool. Personnel working in network security can intuitively know the relationship between network security entities and entities through a network security emergency response knowledge-graph, such as exploitation relationship between malicious software and vulnerabilities, affiliation relationship between attackers and organizations, and relationship between software and vulnerabilities, thereby better dealing with network security problems. After extracting the entities from the network security emergency response text base, the obtained entities are very dispersed entities, and the relation between the entities needs to be known in order to obtain further information. Relationship extraction is a very important task for constructing a network security emergency response knowledge graph from unstructured data.

The Relationship Extraction (RE) is part of Natural Language Processing (NLP) and this part is very important. There are many relationship extraction methods such as bootstrapping, unsupervised relationship discovery and supervised classification. Most existing Relationship Extraction (RE) methods require a large amount of labeled relationship-specific training data, which is very time-consuming and laborious. The remote supervision strategy is an effective and effective method for automatically marking the training data. However, the assumption in the remote supervision approach is too strong, often leading to tag error problems. Therefore, in the framework of remote supervised learning, some recent efforts attempt to use deep neural networks for relationship prediction. Therefore, it is urgently needed to provide a relationship extraction method of a network security emergency response knowledge graph to solve the above problems.

Disclosure of Invention

Therefore, a relation extraction method of the network security emergency response knowledge graph is needed to be provided, influence of noise data on remote supervision is reduced, and a firm foundation is laid for subsequently establishing the network security emergency response knowledge graph.

In order to achieve the above object, the inventor provides a relationship extraction method of a network security emergency response knowledge-graph, comprising the following steps:

s1: giving a network security response knowledge text;

s2: vectorizing the knowledge text data, namely extracting words in the network security response knowledge text, and mapping the words to a K-dimensional word vector;

s3: extracting the position vector corresponding to the vocabulary vector by adopting a position vector mapping method, namely extracting the current word entity

And entities

Relative distance between them, converted into vector representation by embedding;

s4: extracting semantic features of sentences by adopting a residual segmented convolutional neural network JRpcnn to form feature vectors, namely using the vocabulary vectors obtained in the steps S2 and S3 and the position vectors corresponding to the vocabulary vectors as the input of the residual segmented convolutional neural network JRpcnn;

s5: the feature vectors derived in step S4 are further processed using the multiple instance attention mechanism MIT and entity relationships are obtained.

As a preferred embodiment of the present invention, step S2: vectorizing the knowledge text data, namely extracting vocabularies in the network security response knowledge text, and mapping the vocabularies to a K-dimensional vocabulary vector, comprises the following steps:

s201: inputting an original network security emergency response knowledge text, and converting each input word mark into a vector by searching a pre-trained word embedding;

s202: for a given sentence

The method comprises the steps that n words are formed, each word is mapped to a low-dimensional real value vector space by using a word2vec model, One-Hot codes of the words are input by the word2vec, and then the output is obtained through a hidden layer of a neural network;

s203: then, the sentence is executed with word vector processing, finally, the vector representation of each word in the sentence is obtained, and a word vector query matrix is formed

Each input training sequence queries the matrix by a word vector

Mapping to obtain corresponding vocabulary vector

。

As a preferred embodiment of the present invention, step S3: the method for extracting the position vector corresponding to the vocabulary vector by adopting the position vector mapping representation method, namely converting the relative distance between the current word entity and the entity into vector representation by embedding comprises the following steps:

s301: defined as from the current word to the entity by pf

And entities

Randomly initializing two position-embedding matrices

And

converting the relative distance into a vocabulary vector by searching the position embedding matrix;

s302: in sentence position vectorization, if the dimension of the vocabulary vector is

The position vector dimension is

Then the sentence vector dimension is

The combination of the vocabulary vector and the position vector forms a sentence vector

The sentence vector Q is then fed back to the convolution portion.

As a preferred embodiment of the present invention, step S4: extracting semantic features of sentences by adopting a residual segmented convolutional neural network JRpcnn to form feature vectors, namely using the vocabulary vectors obtained in the steps S2 and S3 and the position vectors corresponding to the vocabulary vectors as the input of the residual segmented convolutional neural network JRpcnn, and comprising the following steps:

s401: extracting semantic information of a network security emergency response knowledge text by adopting a residual error segmented convolutional neural network JRpcnn, wherein two convolutional layers form a residual error block, and an activation function Relu is adopted for nonlinear mapping after each convolutional layer;

s402: convolution is an operation between a convolution kernel W and an input vector q sequence, where the convolution kernel W is a weight matrix and the convolution kernel W is used as a convolution filter, and the convolution operation can be expressed as:

wherein i and j represent

N is the number of convolution kernels, s is the dimensionality of the sentence vector, w is the dimensionality of the convolution kernels,

to represent

To

The connection of (2).

The result of one convolution yields a matrix:

as a preferred embodiment of the present invention, the method further comprises the steps of:

s403: the dimensionality of all convolution kernels in the residual error segmentation convolution neural network is w, and boundary filling operation is adopted, and the convolution kernels of two layers of convolution are

The result obtained after the first layer convolution of the residual block is known as step S401

The result obtained after the second layer convolution is

Wherein

，

Is an offset vector, the output vector of the residual convolutional block is C = f (x) + x, where f (x) is the output result of the second convolutional layer, x is the input of the first convolutional layer;

s404: after the convolutional layer obtains semantic features, further extracting the most representative local features through a pooling layer, and adopting a segmented maximum pooling process, wherein the formula is as follows:

for the output of each pooled layer convolution kernel, we can obtain a 3-dimensional vector

And then concatenates all convolution kernel segmented pooling layer outputs into

And then the nonlinear function output is as follows:

wherein

Is composed of

To

The tanh () function is an activation function in a neural network

As a preferred embodiment of the present invention, step S5: further processing the feature vectors obtained in step S4 using the multiple instance attention mechanism MIT, and obtaining entity relationships comprises the steps of:

s501: vector for an instance set

The example set vector describes a corresponding network security emergency response entity pair

Wherein

Represents the output of the neural network;

s502: computing instance vectors

And the correlation degree r, and calculating an example set vector S, wherein the calculation formula of the example set vector S is as follows:

the computation of the instance set vector S depends on each instance in the set;

wherein

Is an input instance vector

For measuring the correlation of the correspondence r,

the calculation formula of (a) is as follows:

is a basic query function, which represents the matching degree between the output vector g and the prediction relation r;

s503: after calculating the value of the instance set vector S, the likelihood of the predicted relationship is calculated, p represents the likelihood of the predicted relationship, and the calculation formula of p is as follows:

wherein S describes a corresponding network security emergency response entity pair,

is a previously defined relationship vector, b represents an offset vector,

two entity pairs representing corresponding relationships,

the calculation formula of (a) is as follows:

。

compared with the prior art, the method has the advantages that the influence of noise data on remote supervision can be reduced, the depth semantic features of sentences can be better extracted by using the depth residual errors, so that entity relations can be more accurately extracted from the network security emergency response texts, and a firm foundation is laid for the subsequent establishment of the network security emergency response knowledge graph.

Drawings

Fig. 1 is a frame diagram of a relationship extraction method of a network security emergency response knowledge-graph according to an embodiment.

Fig. 2 is a schematic diagram of a position code according to an embodiment.

FIG. 3 is a schematic diagram of a Word2vec model according to an embodiment.

FIG. 4 is a schematic diagram of a Skip-Gram model according to an embodiment.

Figure 5 is a graph of the trend of AUC results under cnn, pcnn and JRpcnn neural network models, in accordance with embodiments.

FIG. 6 is a graph of the trend of AUC results under the AVE, ONE and MIT mechanisms according to the embodiments.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1 to 6 together, the knowledge graph is a new idea of analyzing and processing data in the network security emergency response, and in order to better associate the network security emergency response data to construct the knowledge graph, the embodiment provides a relationship extraction method for the knowledge graph of the network security emergency response. The method can accurately and quickly extract the relation between the entities in the network security emergency response text and can help to establish the network security emergency response knowledge graph more quickly, and the method can be used for each network company to establish the network security emergency response knowledge graph base.

As shown in fig. 1, the main structure of the method is as follows:

the input to the network is the original network security emergency response text. When using neural networks, we typically convert word tokens into low-dimensional vectors. In this embodiment, each input word token is converted to a vector by looking up the pre-trained word embedding. Furthermore, we use location features to designate each entity pair and represent these location features with a location vector.

For a given sentence

Consisting of n words, each word is mapped to a low-dimensional real-valued vector space, wo, using the word2vec (word vector generative model) modelThe general structure of the rd2vec model is shown in fig. 3, and word2vec inputs One-Hot coding (One-Hot coding) of the word and then outputs the word through a hidden layer of a neural network. The word2vec model is trained in a Skip-Gram (context word prediction algorithm based on the central word) manner, the general flow of which is shown in fig. 4, a word in a text is given, and then the word is used to predict the adjacent words above and below the word through a neural network.

Then, the sentence is executed with word vector processing, finally, the vector representation of each word in the sentence is obtained, and a word vector query moment is formed

. Each input training sequence can query the matrix through the word vector

Mapping to obtain corresponding real value vector

。

In the task of extracting the relationship of the network security emergency response knowledge graph, the emphasis is placed on finding the relationship between entity pairs. Generally, the relation between entities is more emphasized by comparing words near the entities, and therefore, the position of each word in the sentence in two entities is also important in relation extraction. Pf is defined herein as the distance from the current word to

And

combinations of relative distances of (a). Random initialization of two-position embedded matrix

And

. Then theThe relative distance is converted to a real-valued vector by finding the position-embedding matrix. For example, the vectorization representation of "Jack find chrome has xs vss vulnerability (Jack finds Google browser has xss hole)" as shown in FIG. 2, where "chrome" and "xss" in the sentence correspond to the entities respectively

And entities

. Then, the distance from "Jack" to "chrome" is 2, and the distance from "vulnerability" to "xss" is-1.

In sentence position vectorization, if the word vector dimension

The position vector dimension is

Then the sentence vector dimension is:

the word vector and the position vector are combined to form a sentence vector

The sentence vector Q is then fed back to the convolution portion.

In the relationship extraction in the network security emergency response knowledge graph establishing process, all local features are required to be utilized, and the prediction is carried out in a global scope. When using neural networks, the convolution method is the best way to combine all these features.

The embodiment designs a residual segmented neural network block for extracting semantic information of a network security emergency response knowledge sentence, wherein two convolutional layers form a residual block, and an activation function Relu is used for nonlinear mapping after each convolutional layer.

Convolution is an operation between a convolution kernel W, which is a weight matrix, and a sequence of input vectors q, with the convolution kernel W as a filter for convolution, and in the example shown in fig. 1, we assume that the size of the convolution kernel is set to W (W = 3). This embodiment defines Q as a sequence

Wherein

In general terms, the amount of the solvent to be used,

refer to

To

The connection of (2).

Convolution is to perform dot product on convolution kernel W and sequence q to obtain another sequence

：

The value range of the index j is 1 to s + w-1, and in order to capture the capability of different features, multiple convolution kernels are generally required to be used in the convolution, and under the assumption that n convolution kernels are used, the convolution operation can be expressed as:

wherein i and j represent

N is a weight momentThe number of arrays, s is the dimension of the sentence vector, and w is the dimension of the convolution kernel.

The result of one convolution is a matrix

The sizes of all convolution kernels in the residual error segmentation convolution network are w, and in order to ensure that the size of a newly generated feature matrix is the same as that of an original feature matrix, boundary filling operation is adopted. The convolution kernel of the two-layer convolution is

. The result obtained after the first layer convolution of the residual block is:

the result obtained after the second layer of convolution is:

，

wherein

，

Is the offset vector, the output vector of the residual convolutional block is C = f (x) + x, where f (x) is the output result of the second convolutional layer and x is the input of the first convolutional layer.

After the convolutional layer obtains semantic features, the most representative local features are further extracted through the pooling layer, and in order to obtain feature information of different sentence structures, a segmented maximum pool process is adopted. The output of each convolution kernel, as shown in FIG. 1

Divided into 3 parts by two entitiesThe output dimension is a 3-dimensional vector:

the maximum pooling of segments is to take the maximum value of each part:

。

for the output of each pooled layer convolution kernel, we can obtain a 3-dimensional vector,

Then, the output of the nonlinear function is:

wherein

Is composed of

To

The tanh () function is an activation function in the neural network.

In the method for extracting the relation of the network security emergency response knowledge map of the embodiment, the attention of the sentence level is established on a plurality of examples, and a vector is set for one example set

The example set describes a corresponding network security emergency response entity pair

Wherein

Representing the output of the neural network.

Example vectors are then computed herein

And the degree of association r. In order to reduce the influence of meaningless data and fully utilize semantic information contained in each instance in a set, a calculation formula of an instance set vector S is provided:

(ii) a The calculation of the instance set vector S will depend on each instance in the set.

Wherein

Is an input instance vector

Is used to measure the correlation of the correspondence r.

The calculation formula of (a) is as follows:

is a basic query function that represents the degree of match between the output vector g and the prediction relation r.

After the value of S is calculated, the likelihood of the predicted relationship can be calculated, p represents the likelihood of the predicted relationship, and the calculation formula of p is as follows:

is a previously defined relationship vector, b represents an offset vector,

two entity pairs representing corresponding relationships,

the formula of (c) is as follows:

。

in the embodiment of the present invention, the network framework shown in fig. 1 needs to be trained in advance, and the specific details of the training phase are as follows:

the dataset used in this training is the Comprehensive, Multi-Source Cyber-Security Events dataset. The data set is obtained from various websites and various vulnerability databases on the network, wherein the data set comprises network text data such as network security and vulnerability information.

All network models in this embodiment are trained on Comprehensive, Multi-Source Cyber-Security Events (Comprehensive Multi-Source Cyber-Security) datasets.

To train the network model, the objective function is defined herein with cross-entropy loss.

The set of dimensions of the word vector input to the network is {50, 60.., 300}, and the set of dimensions of the position vector input is {1, 2.., 10 }.

Where the input window of the convolutional network is 3 in size and the hidden layer is 230 in size.

During the network model training process, Adam (Adam) optimizer is used for optimization training to set momentum by default, and the set momentum by default

=0.9，

= 0.999. The network model was first iteratively trained 60 times at a learning rate of 0.01, then iteratively trained 60 times at a learning rate of 0.001, and then learned 60 times at a learning rate of 0.0001. The set of batch sizes processed in one iteration is 40,160,640,1280. To prevent model overfitting, a dropout (random discard algorithm) method is employed herein, where the dropout rate is 0.5.

In order to verify the performance of the method, based on the above embodiment, this embodiment tests the model on Comprehensive, Multi-Source Cyber-Security Events data sets in combination with the online emergency response processing method. The model was evaluated using the commonly used precision-recall curve (P-r), AUC values and mean precision (P & n).

The experimental comparison of the model is mainly carried out in two aspects in the embodiment. On the one hand, cnn (convolutional neural network) algorithms with different performances are adopted, including traditional cnn (convolutional neural network), pcnn (segmented convolutional neural network) and JRpcnn (residual segmented convolutional neural network); the second aspect is based on how cnn/pcnn/JRpcnn uses the attention mechanism to boost the functionality of the model. Three different attention mechanisms were used for testing, namely AVE (average attention mechanism), ONE (single instance attention mechanism) and MIT (multiple instance attention mechanism). The test is carried out on different networks by using three attention mechanisms in a test stage, and then the test result is observed.

As can be seen from fig. 5, the value of AUC using JRpcnn is the largest using the same attention mechanism.

The relation extraction accuracy of the JRpcnn-MIT on the network safety emergency response data set is the highest, reaches 34.6 percent and is better than results obtained by other models, and the AUC value of the JRpcnn-MIT also reaches the highest 12.8 percent and is better than results obtained by other models.

The experimental result shows that compared with other model methods of network security data sets, the method provided by the embodiment has better result, namely the accuracy of the extraction relationship is higher, and the deep semantic information of the sentence can be better extracted. The introduction of the multi-instance attention mechanism method can effectively reduce redundant data in remote supervised learning. And then, based on a specific network security scene, a JRpcnn-MIT is applied to construct a network security emergency response knowledge graph, so that the capability of network security emergency response can be further enhanced.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. The method for extracting the relation of the network security emergency response knowledge graph is characterized by comprising the following steps of:

s1: giving a network security response knowledge text;

And entities

2. The method for extracting the relationship of the network security emergency response knowledge-graph according to claim 1, wherein the step S2: vectorizing the knowledge text data, namely extracting vocabularies in the network security response knowledge text, and mapping the vocabularies to a K-dimensional vocabulary vector, comprises the following steps:

s202: for a given sentence

Each input training sequence queries the matrix by a word vector

Mapping to obtain corresponding vocabulary vector

。

3. The method for extracting the relationship of the network security emergency response knowledge-graph according to claim 2, wherein the step S3: the method for extracting the position vector corresponding to the vocabulary vector by adopting the position vector mapping representation method, namely converting the relative distance between the current word entity and the entity into vector representation by embedding comprises the following steps:

s301: defined as from the current word to the entity by pf

And entities

Randomly initializing two position-embedding matrices

And

The position vector dimension is

Then the sentence vector dimension is

The sentence vector Q is then fed back to the convolution portion.

4. The method for extracting the relationship of the network security emergency response knowledge-graph according to claim 3, wherein the step S4: extracting semantic features of sentences by adopting a residual segmented convolutional neural network JRpcnn to form feature vectors, namely using the vocabulary vectors obtained in the steps S2 and S3 and the position vectors corresponding to the vocabulary vectors as the input of the residual segmented convolutional neural network JRpcnn, and comprising the following steps:

wherein i and j represent

to represent

To

The connection of (1);

the result of one convolution yields a matrix:

。

5. the method for extracting the relationship of the network security emergency response knowledge-graph according to claim 4, further comprising the steps of:

The result obtained after the second layer convolution is

Wherein

，

And then the nonlinear function output is as follows:

wherein

Is composed of

To

The tanh () function is an activation function in the neural network.

6. The relationship extraction method of the network security emergency response knowledge-graph of claim 5, wherein: step S5: further processing the feature vectors obtained in step S4 using the multiple instance attention mechanism MIT, and obtaining entity relationships comprises the steps of:

s501: vector for an instance set

Wherein

Represents the output of the neural network;

s502: computing instance vectors

wherein

Is an input instance vector

For measuring the correlation of the correspondence r,

the calculation formula of (a) is as follows:

s503: after calculating the value of the instance set vector S, the likelihood of the predicted relationship is calculated, representing the likelihood of the predicted relationship, and p is calculated as follows:

is a previously defined relationship vector, b represents an offset vector,

two entity pairs representing corresponding relationships,

the calculation formula of (a) is as follows:

。