CN112989831A

CN112989831A - Entity extraction method applied to network security field

Info

Publication number: CN112989831A
Application number: CN202110333374.5A
Authority: CN
Inventors: 陆以勤; 陈帅豪; 覃健诚; 谢树禄; 李智鹏; 洪炜妍; 陈嘉睿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-06-18
Anticipated expiration: 2041-03-29
Also published as: CN112989831B

Abstract

The invention discloses an entity extraction method applied to the field of network security, which comprises the following steps: inputting the segmented network security text data into a trained word2vec model to obtain a network security field word vector; carrying out artificial corpus annotation on the text data to construct a network security data set; inputting the network security data set into a SecurityBERT model to obtain a character-level vector; fusing the word vector and the character level vector in the network security field; and inputting the output of the BilSTM model into a self-attention layer, and performing local key network security word feature enhancement on the character vector by using a self-attention mechanism to obtain semantic information. The invention further models by using a BilSTM model and a self-attention mechanism to obtain context semantics and capture local key information, thereby improving the entity extraction performance in the field of network security and obtaining better accuracy, recall rate and F1 value.

Description

Entity extraction method applied to network security field

Technical Field

The invention relates to the field of network security, in particular to an entity extraction method applied to the field of network security.

Background

The rapid development and wide application of internet technology greatly promote the prosperity and progress of society, but at the same time, the network space environment becomes increasingly complex and severe. Various types of network attacks, Lesso viruses, trojans, backdoor programs, security holes and the like pose serious threats to the network space. The frequent occurrence of network security events causes economic losses to countries, enterprises and people, and seriously affects the stability of society.

The network space contains a great deal of valuable security information, such as network security logs, alarm information and traffic data, and important security data including system logs, attack events, security blogs, security intelligence, and vulnerability libraries, which can be acquired from a security forum or website. The massive security data has great value, and how to extract effective security information from the massive and fragmented network security data is an important research direction in the field of network security. Therefore, the entity extraction technology oriented to the network security field is produced.

The network security entity extraction technology is a specific domain-oriented entity extraction technology, and generally refers to extracting entities with network security related semantics from unstructured network security text data, such as: attackers, vulnerabilities, virus trojans, attack methods, software, and the like. The entity extraction task generally comprises related tasks such as ontology design, data collection, cleaning and construction, text word segmentation, entity extraction and classification and the like. Compared with the traditional field, the data in the network security field has the characteristics of less data sets, Chinese and English mixing, case and case mixing, digital mixing and the like, and new entities are increased and changed frequently, have more categories, have stronger professional field characteristics, and even have the characteristics of semantic diversity and ambiguity of the same entity. And the traditional word2vec pre-training method, RNN, LSTM model and CRF model entity extraction model algorithms are difficult to accurately identify and cannot be well adapted to the field of network security.

Disclosure of Invention

The invention provides an entity extraction method applied to the field of network security, which aims to solve the problem that the existing method has lower performance indexes of entity extraction accuracy, recall rate and F1 value in the field of network security.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

an entity extraction method applied in the field of network security comprises the following steps: acquiring unstructured text data in the field of network security, and constructing a network security dictionary according to the text data; preprocessing text data and segmenting words; inputting the segmented network security text data into a trained word2vec model or Glove model to obtain a network security field word vector; carrying out artificial corpus annotation on the text data to construct a network security data set; inputting the network security data set into a SecurityBERT model which is subjected to field pre-training to obtain a character-level vector; fusing the word vector of the network security field and the character-level vector output by the SecurityBERT model to obtain a word vector enhanced by the network security word level; inputting the word vector sequence into a BilSTM model for further modeling, wherein the BilSTM model outputs a character vector containing context semantic feature information; inputting the output of the BilSTM model into a self-attention layer, and performing local key network security word feature enhancement on the character vector by using a self-attention mechanism to obtain semantic information; and fusing the output of the self-attention layer and the output of the BilSTM model, and sequentially inputting the fused output into the softmax layer and the conditional random field CRF model to obtain a final label sequence, namely an entity extraction result.

Preferably, preprocessing and word segmentation of the text data comprises: analyzing the HTML webpage by using a python and beautifusoup HTML (hypertext markup language) parser, removing useless tag information, and reserving a core network security related text; removing special characters, converting simplified and traditional forms and converting case and case of the network security related text; the text data is participled using a segmentation tool.

Preferably, the performing artificial corpus annotation on the text data and the constructing the network security data set include: designing a body model of the network security field to obtain the category of a network security entity; and according to the ontology model, carrying out structural annotation on the text by using a brat tool, and converting the structural annotation result into a BIO or BIOES annotation format.

Preferably, the step of training the SecurityBERT model comprises: and performing field pre-training on the BERT-Base-Chinese pre-training model by using the collected text data of the unmarked network security field, so that the trained SecurityBERT model has network security field adaptability.

Preferably, fusing the network security domain word vector and the character-level vector output by the SecurityBERT model comprises: vector stitching based methods and/or vector addition based methods.

Preferably, each character in the character-level vector output by the SecurityBERT model has a corresponding word segmentation result in a sentence, the corresponding word vector is searched in a network security word vector table according to the word segmentation result, the searched word vector is fused with the character-level vector output by the SecurityBERT model, and the word-level characteristics are enhanced; if the corresponding word vector is searched in the network security word vector table according to the word segmentation result, fusing the < padding > vector or the random vector with the character-level vector output by the SecurityBERT model; if one character corresponds to a plurality of word segmentation results, all searched word vectors are fused with the character level vectors output by the SecurityBERT model, or one or more searched word vectors are selected to be fused with the character level vectors output by the SecurityBERT model.

Preferably, the word vector sequence is further modeled by inputting into a BilTM model, and the BilTM model outputting the character vector containing the context semantic feature information comprises: and splicing the output vector of the forward LSTM and the output vector of the reverse LSTM in the BiLSTM model to obtain the feature vector with context information.

Preferably, the local key network security word feature enhancement on the character vector using a self-attention mechanism comprises: the self-attention mechanism distributes weights larger than M to the network security words with the important value larger than K in the sentences through a weighting method, so that the local key network security word feature enhancement is realized; k is greater than 0, M is greater than 0, wherein the calculation method of the weight is a scaling dot product operation function.

Preferably, fusing the output of the attention layer and the output of the BilSTM model, and sequentially inputting the fused output into the softmax layer and the conditional random field CRF model to obtain the final tag sequence, wherein the final tag sequence comprises: adding or splicing the output vector of the self-attention layer and the output vector of the BilSTM layer to obtain a new vector; and inputting the new vector into a softmax layer for multi-classification and probability normalization, then inputting into a CRF layer for sequence label conversion modeling, and outputting a label sequence, namely an entity extraction result.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

the entity extraction method applied to the network security field of the invention carries out field pre-training based on the BERT model to obtain a SecurityBERT model facing the network security field, has field adaptability, is more suitable for downstream security entity extraction tasks, simultaneously fuses network security word vectors and SecurityBERT word vectors, enhances the expression capability of word level, is more easy to distinguish the boundary information of the network security entity, further models by using a BilSTM model and a self-attention mechanism, obtains context semantics and captures local key information, improves the entity extraction performance of the network security field, obtains better accuracy rate, recall rate and F1 value, also improves the automatic extraction capability of the information of the network security field, greatly reduces the workload of security expert analysis, and lays a foundation for the construction of a subsequent network security knowledge map.

Drawings

Fig. 1 is a flowchart illustrating an entity extraction method applied in the field of network security according to this embodiment.

Fig. 2 is a structural diagram of a model of an entity extraction method applied in the network security field according to the present embodiment.

Detailed Description

The invention is further described below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the work flow diagram of the entity extraction method applied in the network security field provided in this embodiment includes the following steps:

step 101: acquiring unstructured text data in the field of network security from the Internet, and constructing a network security dictionary according to the text data;

step 102: performing operations such as cleaning, preprocessing, word segmentation and the like on the text data;

step 103: carrying out artificial corpus annotation on the text data to construct a network security data set;

step 104: the security domain-oriented security domain pretrains a SecurityBERT model;

step 105: inputting the segmented network security text data into a trained word2vec model or Glove model to obtain a network security field word vector;

step 106: inputting the data set into a SecurityBERT model to obtain character-level vector output;

step 107: fusing the word vector of the network security field and the character level vector output by the SecurityBERT;

step 108: inputting the fused word vector sequence into a BilSTM model to further model context semantic features;

step 109: performing local key network security word feature enhancement on the character vector by using a self-attention mechanism;

step 110: fusing the outputs of the self-attention layer and the BilSTM layer and inputting the fused outputs into a softmax layer and a CRF model;

step 111: and outputting the extraction result of the network security entity.

The entities are vulnerabilities (Vulnerability), Software (Software), Malware (Malware), and the like. A vulnerability represents a flaw in the specific implementation of hardware, software, protocols, or system security policies. For example: a permanent blue leak, a UAF leak, cve-2018-5002, etc. Software (Software) represents an entity, such as a data, program, business system, etc., that runs on a computer. For example: office, IE browser, softenable, Web server, etc. Malware (Malware) refers to software or files that are run by executing unauthorized functions or computer systems. For example: bait documents, Havex trojans, remote trojans, and the like.

In this embodiment, the method in step 102 specifically includes: analyzing the HTML webpage by using a python and beautifusoup HTML (hypertext markup language) analyzer, removing useless tag information, and reserving core network security related text content; carrying out preprocessing operations such as special character removal, simplified and traditional body conversion, case and case conversion and the like on the network security text; the text data is tokenized using jieba or other tokenization tools.

In this embodiment, the method in step 103 specifically includes: firstly, designing a body model in the field of network security to obtain the category of a network security entity; then, according to the ontology model, carrying out structured labeling on the text by using a brat tool; and finally, converting the result of the structured annotation into a BIO or BIOES annotation format.

In this embodiment, the method in step 104 specifically includes: based on a BERT-Base-Chinese pre-training model, the collected massive unlabeled text data of the network security field is used for deep field pre-training, so that the trained SecurityBERT model has network security field adaptability.

In this embodiment, the method in step 107 specifically includes: the SecurityBERT model outputs a vector of each character, each character has a corresponding word segmentation result in a sentence, the corresponding word vector is searched in a network security word vector table according to the word segmentation, and the word vector is fused with the character-level vector output by the securityBERT model to enhance the word-level characteristics; for the words which are not found, the word vectors are replaced by < padding > vectors or random vectors; for the situation that one character corresponds to a plurality of word segmentation results, all word vectors can be fused with the character-level vectors output by the ecurityBERT model, and one or more word vectors can be selected according to strategies to be fused with the character-level vectors output by the ecurityBERT model; the fusion method may be based on a vector splicing method, a vector addition method, or a combination of various methods.

In this embodiment, the method in step 108 specifically includes: and splicing the output vector of the forward LSTM and the output vector of the backward LSTM to obtain a feature vector with context information.

In this embodiment, the method in step 109 specifically includes: the self-attention mechanism distributes higher weight to more important network security words in the sentence through a weighting method, and achieves the enhancement of local key network security word characteristics; the weight calculation method uses a scaled dot product operation function.

In this embodiment, the method in step 110 specifically includes: adding or splicing the output vector of the self-attention layer and the output vector of the BilSTM layer to obtain a new vector; and inputting the new vector into a softmax layer for multi-classification and probability normalization, then inputting into a CRF layer for sequence label conversion modeling, and outputting a label sequence, namely an entity extraction result.

The above-described embodiments are only preferred embodiments of the present invention, and it should be understood that many variations and modifications can be made by one of ordinary skill in the art in light of the above-described inventive concept without undue experimentation. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. An entity extraction method applied in the field of network security is characterized by comprising the following steps:

acquiring unstructured text data in the field of network security, and constructing a network security dictionary according to the text data;

preprocessing text data and segmenting words; inputting the segmented network security text data into a trained word2vec model or Glove model to obtain a network security field word vector;

carrying out artificial corpus annotation on the text data to construct a network security data set; inputting the network security data set into a SecurityBERT model which is subjected to field pre-training to obtain a character-level vector;

fusing the word vector of the network security field and the character-level vector output by the SecurityBERT model to obtain a word vector enhanced by the network security word level;

inputting the word vector sequence into a BilSTM model for further modeling, wherein the BilSTM model outputs a character vector containing context semantic feature information;

inputting the output of the BilSTM model into a self-attention layer, and performing local key network security word feature enhancement on the character vector by using a self-attention mechanism to obtain semantic information;

and fusing the output of the self-attention layer and the output of the BilSTM model, and sequentially inputting the fused output into the softmax layer and the conditional random field CRF model to obtain a final label sequence, namely an entity extraction result.

2. The entity extraction method applied to the network security field as claimed in claim 1, wherein the preprocessing and word segmentation of the text data comprises:

analyzing the HTML webpage by using a python and beautifusoup HTML (hypertext markup language) parser, removing useless tag information, and reserving a core network security related text;

removing special characters, converting simplified and traditional forms and converting case and case of the network security related text;

the text data is participled using a segmentation tool.

3. The entity extraction method applied to the network security field of claim 1, wherein the manual corpus labeling is performed on the text data, and the constructing of the network security data set comprises:

designing a body model of the network security field to obtain the category of a network security entity;

and according to the ontology model, carrying out structural annotation on the text by using a brat tool, and converting the structural annotation result into a BIO or BIOES annotation format.

4. The entity extraction method applied in the network security field according to claim 1, wherein the step of training the SecurityBERT model comprises:

and performing field pre-training on the BERT-Base-Chinese pre-training model by using the collected text data of the unmarked network security field, so that the trained SecurityBERT model has network security field adaptability.

5. The entity extraction method applied to the network security domain according to claim 1, wherein fusing the network security domain word vector and the character-level vector output by the SecurityBERT model comprises: vector stitching based methods and/or vector addition based methods.

6. The entity extraction method applied in the network security field according to claim 5, wherein each character in the character level vector output by the SecurityBERT model has a corresponding word segmentation result in a sentence, the corresponding word vector is searched in the network security word vector table according to the word segmentation result, and the searched word vector is fused with the character level vector output by the SecurityBERT model to enhance the word level characteristics;

if the corresponding word vector is searched in the network security word vector table according to the word segmentation result, fusing the < padding > vector or the random vector with the character-level vector output by the SecurityBERT model;

if one character corresponds to a plurality of word segmentation results, all searched word vectors are fused with the character level vectors output by the SecurityBERT model, or one or more searched word vectors are selected to be fused with the character level vectors output by the SecurityBERT model.

7. The entity extraction method applied in the network security field of claim 1, wherein inputting the word vector sequence into a BilSTM model for further modeling, the BilSTM model outputting the character vector containing the context semantic feature information comprises:

and splicing the output vector of the forward LSTM and the output vector of the reverse LSTM in the BiLSTM model to obtain the feature vector with context information.

8. The entity extraction method applied to the network security field according to claim 1, wherein the local key network security word feature enhancement of the character vector using a self-attention mechanism comprises:

the self-attention mechanism distributes weights larger than M to the network security words with the important value larger than K in the sentences through a weighting method, so that the local key network security word feature enhancement is realized; k is greater than 0, M is greater than 0, wherein the calculation method of the weight is a scaling dot product operation function.

9. The method for extracting entities applied in the network security field of claim 1, wherein fusing the output from the attention layer and the output of the BilSTM model and then sequentially inputting the fused output into the softmax layer and the conditional random field CRF model to obtain the final tag sequence comprises:

adding or splicing the output vector of the self-attention layer and the output vector of the BilSTM layer to obtain a new vector;

and inputting the new vector into a softmax layer for multi-classification and probability normalization, then inputting into a CRF layer for sequence label conversion modeling, and outputting a label sequence, namely an entity extraction result.