CN116340540A

CN116340540A - Method for generating network security emergency response knowledge graph based on text

Info

Publication number: CN116340540A
Application number: CN202310316305.2A
Authority: CN
Inventors: 车洵; 李千目; 朱旻昊; 陈竞飞; 赵谦; 刘帆; 征煜
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-27

Abstract

The invention discloses a method for generating a network security emergency response knowledge graph based on text. The method comprises the following steps: firstly, generating text nodes and node characteristics through a pre-trained coder decoder language model; then through inputting text and node inquiry which can be learned, the characteristic of the inquired node is output, and then the node text is generated through LSTM; then merging the text node characteristics and the query node characteristics to obtain final node characteristics; then generating and fusing edges by using two modes of generation and classification; finally we train the network by aggregating the loss functions and sparse adjacency matrices. The network security knowledge graph generated by the method has extremely high applicability and accuracy.

Description

Method for generating network security emergency response knowledge graph based on text

Technical Field

The invention belongs to the field of network security knowledge graph.

Background

Network security emergency response refers to utilizing internally stored security knowledge to cope with a possible threat and taking corresponding measures after the threat occurs. With the emerging threat of increasing complexity, traditional passive defense approaches to network security have been struggled. Therefore, people are innovating in the field of network security, and the standards required for improving the emergency command capability and efficiency for coping with different situations are also higher. To address this problem, the scientific community has proposed using knowledge-graph to address network security issues. The knowledge graph is a new method for analyzing and processing network security data, and the network security emergency response knowledge graph is generated. By using the network safety emergency response knowledge graph, safety emergency personnel can quickly identify and analyze safety events and know required emergency response flows and tool technologies, so that the efficiency of safety emergency response is improved. The network security emergency response knowledge graph is a data-driven, linear, very computationally powerful tool. Personnel engaged in network security work can intuitively know the relationship between network security entities and entities through a network security emergency response knowledge graph, such as the utilization relationship between malicious software and loopholes, the countermeasure relationship between attackers and security protection equipment and the relationship between systems and loopholes, so that network security problems can be better treated, the quality of the network security knowledge graph plays a decisive role in subsequent application based on the knowledge graph, how to trace the network security according to the knowledge graph, and how to generate an accurate network security knowledge graph becomes popular in research.

Disclosure of Invention

The invention aims to realize the construction of a high-performance knowledge graph through texts, and the traditional knowledge graph construction method is time-consuming and labor-consuming, because even common knowledge graph nodes are required to be generated and checked with high labor cost, and in the traditional method, the nodes are hundreds of nodes in the knowledge graph, and a great deal of labor is consumed for repeated operation, so that the knowledge graph construction efficiency is quite low. How to quickly construct the network security knowledge graph becomes a great importance, the invention constructs the network security knowledge graph by a novel method, solves various defects of the traditional method, and the network security knowledge graph constructed by the method has extremely high applicability and accuracy. The invention introduces a novel method for generating a network security emergency response knowledge graph based on text. Firstly, generating text nodes and node characteristics through a pre-trained coder decoder language model; then through inputting text and node inquiry which can be learned, the characteristic of the inquired node is output, and then the node text is generated through LSTM; then fusing text node characteristics and query node characteristics to obtain final node characteristics, and generating and fusing edges by using two modes of generation and classification; finally we train the network by aggregating the loss functions and sparse adjacency matrices. The network security knowledge graph generated by the method has extremely high applicability and accuracy.

Technical means

The invention aims to realize the construction of a high-performance knowledge graph through texts, and the traditional knowledge graph construction method is time-consuming and labor-consuming. This is because even ordinary knowledge graph nodes require high labor cost for generation and verification, and such nodes usually reach millions in large-scale knowledge graphs, and in the conventional method, a great deal of labor is consumed for repeated operation, so that the efficiency of building the knowledge graph is quite low, and how to quickly build the network security knowledge graph becomes important. The network security knowledge graph is constructed by the novel method, so that various defects of the traditional method are overcome, and the network security knowledge graph constructed by the method has extremely high applicability and accuracy. The method comprises the following steps:

s1, generating text nodes and node characteristics through a pre-trained coder decoder language model.

S2, outputting query node characteristics through inputting text and a learnable node query, and then generating node text through LSTM.

And S3, merging the text node characteristics and the query node characteristics to obtain final node characteristics.

And S4, generating and fusing edges by using two modes of generation and classification.

S5, training the network through the aggregation loss function and the sparse adjacency matrix.

As a preferred mode of the present invention, the step S1 includes the steps of:

s101, formulating NODE generation into a sequence-to-sequence problem using a pre-trained encoder-decoder language model, wherein the system is fine-tuned to convert text input into a sequence of NODEs, separated by a special tag < PAD > NODE1< NODE_SEP > NODE2 … </S >, wherein NODEi represents one or more words;

s102, the module can generate NODEs and provide NODE characteristics for the generated edge tasks, each NODE can be associated with a plurality of NODE characteristics, NODE boundaries are described by using a separation mark < NODE_SEP >, a character string is generated through greedy decoding, and the hidden state of the last layer of the decoder is subjected to mean value pooling; we pre-fix the number of generating NODEs and fill the missing NODEs with special < no_node > tokens.

As a preferred mode of the present invention, the step S2 shown includes the steps of:

s201, the decoder receives as input a set of learnable node queriesAnd represents this as an embedded feature matrix, the output of the decoder can now be read directly as if no causal mask was used in order to ensure that the network can handle all queries simultaneously

N represents the number of nodes, d represents the node feature dimension, and is passed to the pre-header LSTM for decoding into node logic +.>

Where S is the length of the generated node sequence and V is the vocabulary size;

s202, in order to avoid the network remembering a specific target node order, the logits and features are arranged as

L′ _n (s)＝L _n (s)P,F′ _n ＝F _n P,

Where s=1, …, S, and

to a permutation matrix obtained using a binary matching method between the target and greedy decoding nodes. Node feature F 'processed by permutation matrix using cross entropy loss as matching cost function' _n Now target aligned.

As a preferred mode of the present invention, the step S3 includes the steps of:

s301, in order to fully utilize the characteristics of text nodes and query nodes, a node fusion module is designed, the characteristics obtained in the previous two steps are spliced together, the characteristics are fused through a residual block, and then important information in the characteristics is extracted through a self-attention module;

s302, carrying out feature enhancement through a cavity space convolution pooling pyramid and a convolution attention module, compressing features through 3d convolution of 5 multiplied by 3, and paying attention to the compressed important feature information through a channel attention module;

s303, in order to remove redundant information in the generated node characteristics, predicting the redundant information in the node by using a simple encoder-decoder structure, and then subtracting the redundant information from the original information;

and S304, measuring the similarity between node characteristics by using dot multiplication, mapping the obtained similarity to between 0 and 1 through a softmax function, and randomly removing one node if the similarity is larger than Y so as to control the number of redundant nodes. Experiments show that the best effect is obtained when y=0.7, and the final node characteristics are obtained after the redundant nodes are deleted.

As a preferred mode of the present invention, the step S4 includes the steps of:

s401, the node characteristics of the last step are then used in the module to generate edges. Given a pair of node characteristics, the prediction head decides whether edges exist between the respective nodes, generates edges by two ways, and then fuses the edges generated by the two ways;

s402, first generating edges as a tag sequence using LSTM. The advantage of the generation is that any edge sequence can be constructed, including edge sequences that are not visible during training, but there is a risk of not perfectly matching the target edge token sequence;

s403, the classification header is then used to predict edges. If the set of possible relationships is fixed and known, the classification head will be more efficient and accurate, but if training has limited coverage of all possible edges, the system may misclassify during reasoning;

and S404, splicing edges in pairs, and then fusing the features through the dense layer. The confidence level is evaluated for the fused edges using a trained scoring network, and if the confidence level is greater than 0.5, the fused edges are retained.

As a preferred mode of the present invention, the step S5 includes:

s501, generating and predicting up to N2 edges, where N is the number of nodes, due to the need to check for the presence of edges between all node pairs. The cost of some calculations is reduced by omitting the self-loop and omitting the NODE connected to a specific tag < NO NODE >. When there is NO EDGE between two nodes, it is represented by a special token < no_edge >;

s502 a novel focus loss is proposed, denoted with the symbol F, the main idea of which is to reduce the cross entropy loss of well-classified samples < no_edge > and to increase the cross entropy loss for misclassification, as follows:

F＝-(1-p _t ) ^γ log(p _t ),

where γ is a weighting factor, when γ=0, the two losses are equal. p is the probability of a single edge and t is the target class. P is p _t Probability of representing the target class;

s503, a method of modifying training settings is proposed, namely removing most < no_edge > EDGEs by thinning the adjacency matrix, leaving all actual EDGEs but only some randomly selected < no_edge > EDGEs; the modification can improve the accuracy by 10-20%, and the training time by using the sparse edge is shortened by 10%;

s504, removing some actual edges and randomly replacing some edges after the previous step of sparse adjacent matrix operation so as to enhance the robustness of the model, wherein the accuracy can be improved by 5% -10% by modifying the model.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention.

Fig. 2 is a diagram of generating a text node map in accordance with an embodiment of the present invention.

FIG. 3 is a diagram of generating query node points in accordance with an embodiment of the present invention.

Detailed Description

In order to describe the technical content, constructional features, achieved objects and effects of the technical solution in detail, the following description is made in connection with the specific embodiments in conjunction with the accompanying drawings.

Referring to fig. 1, as shown in the drawing, the present embodiment provides a method for generating a network security emergency response knowledge graph based on text, which includes the following steps:

s1, generating text nodes and node characteristics through a pre-trained coder decoder language model. See fig. 2.

S2, outputting query node characteristics through inputting text and a learnable node query, and then generating node text through LSTM. See fig. 3.

In the above embodiment, S1 further includes the steps of:

s102, the module can generate NODEs and provide NODE characteristics for the generated edge tasks, each NODE can be associated with a plurality of NODE characteristics, NODE boundaries are described by using a separation mark < NODE_SEP >, a character string is generated by greedy decoding, and the hidden state of the last layer of the decoder is subjected to mean value pooling; we pre-fix the number of generating NODEs and fill the missing NODEs with special < no_node > tokens.

In the above embodiment, S2 further includes the steps of:

s201, the decoder receives as input a set of learnable node queries and represents them as embedded feature matrices, the output of which can now be read directly as if all queries were processed simultaneously by the network without using causal masks

L ^′ _n (s)＝L _n (s)P,F _n ^′ ＝F _n P,

Where s=1, …, S, and

to a permutation matrix obtained using a binary matching method between the target and greedy decoding nodes. Node feature F processed by permutation matrix using cross entropy loss as matching cost function _n ^′ Now target aligned.

In the above embodiment, S3 further includes the steps of:

s301, in order to fully utilize the characteristics of text nodes and query nodes, a node fusion module is designed, and the characteristics obtained in the previous two steps are spliced together. Firstly, fusing characteristics through a residual error block, and then extracting important information in the characteristics through a self-attention module;

and S304, measuring the similarity between node characteristics by using dot multiplication, and mapping the obtained similarity to between 0 and 1 through a softmax function. If the similarity is larger than Y, one node is randomly removed to control the number of redundant nodes. Experiments show that the best effect is obtained when y=0.7, and the final node characteristics are obtained after the redundant nodes are deleted.

In the above embodiment, S4 further includes the steps of:

s401, generating edges by using the node characteristics of the last step in the module, giving a pair of node characteristics, determining whether edges exist between the nodes of the node characteristics by a prediction head, generating the edges by using two modes, and fusing the edges generated by the node characteristics;

s402, firstly, generating edges into a marker sequence by using LSTM, wherein the generation has the advantages of being capable of constructing any edge sequence, including edge sequences which cannot be seen during training, but have the risk of not being completely matched with a target edge token sequence;

s403, predicting edges using a classification head, which is more efficient and accurate if the set of possible relationships is fixed and known, but which may misclassifie during reasoning if training has limited coverage of all possible edges;

and S404, splicing the edges in pairs, fusing the features through the dense layer, and evaluating the confidence coefficient of the fused edges by using a trained scoring network, wherein if the confidence coefficient is greater than 0.5, the fused edges are reserved.

In the above embodiment, S5 further includes the steps of:

F＝-(1-p _t ) ^γ log(p _t ),

To demonstrate the effectiveness of the present invention, different data sets were used for verification. The data sets specifically used are compnsed, multi-Source Cyber-Security Events (all-round Multi-Source network Security activity) data sets and ADFA (intrusion detection data set) data sets. Wherein the complex, multi-Source Cyber-Security Events data set is obtained from various websites and various vulnerability databases on the network, including network Security and vulnerability information and network text data. The ADFA (intrusion detection dataset) dataset contains data of various intrusions; webSEC2020 (network security knowledge dataset) is a dataset of network security emergency responses, consisting of multiple sets of exception events and corresponding tags; MAWILab (network traffic anomaly dataset) is a network traffic anomaly detection dataset that consists of multiple sets of labels of traffic anomalies. A large number of experiments show that the method is superior to the most advanced method. Performance was 20% higher on the ADFA dataset compared to the BT5 method. Performance was 25% higher on the complex-Source Cyber-Security Events dataset compared to the ReGen method. The experimental results are as follows:

TABLE 1 feature semantic similarity matching results for different datasets

Experimental results show that the network security knowledge graph generated by the method has extremely high applicability and accuracy.

Claims

1. A method for generating a network security emergency response knowledge graph based on text is characterized by comprising the following steps: the method comprises the following steps:

s1, generating text nodes and node characteristics through a pre-trained coder decoder language model;

s2, outputting query node characteristics through inputting text and node query capable of learning, and then generating node text through LSTM;

s3, merging the text node characteristics and the query node characteristics to obtain final node characteristics;

s4, generating and fusing edges by using two modes of generation and classification;

2. The method for generating the network security emergency response knowledge graph based on the text is characterized in that the S1 further comprises the following steps:

3. The method for generating the network security emergency response knowledge graph based on the text is characterized in that the S2 further comprises the following steps:

N represents the node number, d represents the node feature dimension, and is passed to the pre-header LSTM for decoding into node logic

s202, in order to avoid the network memorizing the specific target node sequence, the logits and features are arranged as L ^′ _n (s)＝L _n (s)P,F _n ^′ ＝F _n P,

Where s=1, …, S, and

to use a permutation matrix obtained by a binary matching method between target and greedy decoding nodes, the node characteristics F processed by the permutation matrix are used as a matching cost function _n ^′ Now target aligned.

4. The method for generating the network security emergency response knowledge graph based on the text is characterized in that the S3 further comprises the following steps:

5. The method for generating the network security emergency response knowledge graph based on the text is characterized in that the S4 further comprises the following steps:

6. The method for generating the network security emergency response knowledge graph based on the text is characterized in that the S5 further comprises the following steps:

s501, generating and predicting up to N2 EDGEs, where N is the number of NODEs, by omitting self-loops and NODEs connected to a specific label < no_node > to reduce the cost of some calculations, when there is NO EDGE between two NODEs, represented by a special token < no_edge >;

F＝-(1-p _t ) ^γ log(p _t ),

where γ is a weighting factor, when γ=0, the two losses are equal; p is the probability of a single edge, and t is the target class, p _t Probability of representing the target class;