CN115759092A

CN115759092A - Network threat information named entity identification method based on ALBERT

Info

Publication number: CN115759092A
Application number: CN202211251727.8A
Authority: CN
Inventors: 周景贤; 王曾琪; 王双
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-03-07

Abstract

The invention relates to a named entity recognition method for network threat information based on ALBERT (equivalent likelihood prediction), and provides a named entity recognition model for network threat information by fusing the prior methods of ALBERT and BilSTM-CRF aiming at the problems that the traditional word embedding cannot well represent word ambiguity and the extraction of field features is insufficient and the information of the threat entity is difficult to effectively recognize; meanwhile, a network threat information entity data set (CTI-E) is manually marked for feature learning and training of the model by combining with actual conditions, and the problem of insufficient vector of training words of the model is solved. Compared with the prior art model and method, the method has the advantages that the time and resource cost for providing the model are greatly superior under the condition of the same identification accuracy rate through comparison experiment verification, and the method is suitable for mass and efficient entity identification tasks in the field of network threat information.

Description

Network threat information named entity identification method based on ALBERT

Technical Field

The invention belongs to the technical field of network security, and particularly relates to an ALBERT-based network threat information named entity identification method.

Background

With the explosive growth of the quantity of cyber threat intelligence, manual analysis of such threat intelligence is time consuming and labor intensive, and most are distributed in unstructured text. If the cyber threat intelligence is converted into a structured and machine readable format, the efficiency of dealing with cyber attacks using the cyber threat intelligence can be improved. In this process, the most critical step is to use professional domain knowledge to identify entities and their relationships related to the cyber threat, such as users, malicious programs, hacker organizations, vulnerabilities, and the like. In recent years, the development of natural language processing is remarkable, wherein Named Entity Recognition (NER) is used for recognizing words with special meanings in texts, and the efficiency of threat information analysis work can be greatly improved by combining network threat information analysis with the Named Entity Recognition. However, the direct application of natural language processing techniques, particularly named entity recognition methods, to the threat intelligence field still faces many challenges:

firstly, named entity identification in the general field generally identifies names of people, places and organizations in texts, and network threat intelligence can obtain a complete attack chain only by identifying various threat entities (such as hacker organizations, malicious tools, attack purposes and the like), and higher requirements are provided for a data set and an identification method. Secondly, the current professional field entity identification generally marks field data sets through a large amount of manual work, so that the cost is high, the identification precision is low, and the difficulty is high. In the network security research, although some research documents construct data sets related to the network security field, the data sets cannot be applied to network threat intelligence entity identification. Finally, the current mainstream rule and dictionary template-based method mainly depends on the rules manually written by experts, can only be used for the specific field of the specified rules, and has low recognition precision and labor and time cost consumption.

Therefore, the current named entity identification method cannot provide efficient and accurate identification for threat intelligence, and is difficult to meet the requirements of identification and processing of massive threat intelligence data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a network threat information named entity identification method based on ALBERT, and can improve the efficiency and accuracy of threat information named entity identification and enrich model training word vectors.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

an ALBERT-based network threat information named entity identification method comprises the following steps:

step 1, collecting network threat information data and preprocessing the network threat information data to construct a data set;

step 2, constructing an ALBERT-BilSTM-CRF model;

and 3, training the ALBERT-BilSTM-CRF model, extracting context syntax and semantic feature information from a threat intelligence field corpus, and identifying the network threat intelligence named entity.

Further, the step 1 includes the steps of:

step 1.1, data collection: collecting and analyzing threat information reports from an open source threat information website by adopting a crawler tool, and cleaning and deleting unavailable part of report entities;

step 1.2, data annotation: marking threat information in a hypochondrium information report by using a Brat Rapid Annotation Tool (Brat), and simultaneously selecting an industry standard marking form BIO in the text marking field for marking;

step 1.3, data set statistics: according to the characteristics of network threat information and the experience knowledge of field experts, combining threat information standard STIX, selecting and marking 9 categories in a data set for classification, wherein the categories comprise hacker organizations, attacks, network security teams, malicious tools, purposes, industries, attack modes, loopholes and characteristics;

step 1.4, data set evaluation: and taking the marked and classified threat intelligence data set as original data, selecting 70% of original data texts as a training set, 15% as a verification set and 15% as a test set, and adopting the accuracy (P), the recall rate (R) and the F1 score as indexes for measuring the recognition performance of the ALBERT-BilSTM-CRF model.

Moreover, the ALBERT-BilSTM-CRF model in the step 2 comprises an ALBERT layer, a BilSTM layer, an Attention layer and a CRF layer, and the construction method comprises the following steps:

step 2.1, constructing an ALBERT layer;

step 2.2, constructing an ALBERT-BilSTM layer;

step 2.3, combining the constructed ALBERT-BilSTM layer with an attention mechanism, and introducing an attention matrix A to calculate the relation between the current target vector and all vectors in the sequence;

and 2.4, constructing a CRF network, taking the CRF network as a sequence marking layer of an ALBERT-BilSTM layer combined with an attention mechanism, considering the relevance of sentence context, ensuring the accuracy of sequence marking, and simultaneously obtaining an ALBERT-BilSTM-CRF model.

Moreover, said step 2.1 comprises the steps of:

step 2.1.1, factorizing a word embedding parameter matrix into two small matrixes in a pre-language processing model BERT model;

step 2.2, replacing NSP (Next-sense prediction) loss by SOP (sequence-order prediction) loss;

and 2.3, cross-layer parameter sharing, wherein a Transformer is adopted to share the full connection layer and the Attention layer, and all parameters of the hidden layer are shared.

Moreover, the specific implementation method of the step 2.2 is as follows: the construction is performed using a bidirectional LSTM network, which includes two unidirectional networks: the LSTM forward propagation network is used for calculating forward hidden features; the LSTM backpropagation network is used to compute the reverse hidden features.

Further, the step 3 includes the steps of:

step 3.1, inputting words in sentences of an original unmarked threat information field corpus as an ALBERT-BilSTM-CRF model, training on threat information corpus data which is not subjected to BIO marking through an ALBERT layer, and extracting context syntax and semantic feature information to obtain dynamic word vectors;

step 3.2, inputting the word vectors into a BilSTM layer to learn the sequence characteristic information to obtain learned text vectors;

step 3.3, weighting the word vector obtained by the BilSt layer and the learned text vector obtained by the BilSt layer through the Attention layer to obtain a word vector combined with Attention weight;

and 3.4, correcting through a CRF layer to obtain a sequence label with the maximum probability and outputting the sequence label.

Moreover, said step 3.1 comprises the steps of:

step 3.1.1, the ALBERT-BilSTM-CRF model input is expressed as three parts: the method comprises the steps of generating a word vector matrix according to dimensions to represent input words, wherein the word vector is changed along with the dimensions of a model; the segment vector is used for a next sentence prediction task, two sentences need to be distinguished, the beginning is marked by a [ CLS ] symbol, and the tail of the sentence is added with [ SEP ]; the position vector marks position information, and the problem that a Transformer model cannot remember a time sequence is solved;

and 3.1.2, obtaining an output vector Xn of the final ALBERT layer through a plurality of transform codes after the character vector representation is obtained.

Moreover, the specific implementation method of the step 3.2 is as follows: and fusing the characteristic information calculated by the BilSTM layer to form a final hidden state, and simultaneously considering the context information to obtain a learned text.

Moreover, the specific implementation method of the step 3.3 is as follows:

step 3.3.1, mapping the currently processed word vectors to all word alignments in corresponding subspaces after linear transformation;

step 3.3.2, normalizing the result to obtain the weight of each word, and highlighting the effect of keywords related to threat intelligence in the text;

3.3.3, introducing an attention matrix A to calculate the relation between the current target vector and all vectors in the sequence;

step 3.3.4, current target vector x _t And the jth vector x in the sequence _j Comparing to obtain a word vector r combined with attention weight in the attention matrix _t,j 。

The invention has the advantages and positive effects that:

the invention provides a named entity recognition model facing network threat information, aiming at the problems that the traditional word embedding can not well express word ambiguity and the domain feature extraction is insufficient and the threat entity information is difficult to be effectively recognized, and integrating the existing methods of ALBERT and BilSTM-CRF; meanwhile, in combination with the actual situation, a network threat information entity data set (CTI-E) is manually marked for feature learning and training of the model, and the problem of insufficient vector of the training words of the model is solved. Compared with the prior art model and method, the method has the advantages of greatly improving the time and resource cost of the model under the condition of the same identification accuracy rate through comparison experiment verification, and is suitable for mass and efficient entity identification tasks in the field of network threat information.

Drawings

FIG. 1 is a diagram of the ALBERT-BilSTM-CRF (CTI-ALBC) model structure of the present invention;

FIG. 2 is a diagram of the ALBERT pre-training language model architecture of the present invention;

FIG. 3 is a diagram of a transform coding unit according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

A network threat information named entity identification method based on ALBERT comprises the following steps:

step 1, collecting and preprocessing network threat information data to construct a data set. A plurality of threat information reports are collected and analyzed from an open source threat information website, and after data are cleaned and mined, the data are labeled to construct a data set in the threat information field.

Step 1.1, data collection: and collecting and analyzing threat intelligence reports from an open source threat intelligence website by adopting a crawler tool, and cleaning and deleting unavailable report entities.

Step 1.2, data annotation: threat intelligence in the flank intelligence report is labeled by using a Brat Rapid Annotation Tool (Brat), wherein the Brat Rapid Annotation Tool (Brat) is a Web-based text labeling Tool, and meanwhile, an industry standard labeling form BIO in the text labeling field is selected for labeling.

Step 1.3, data set statistics: according to the characteristics of network threat intelligence and the experience knowledge of domain experts, 9 categories are selected and marked in a data set for classification by combining threat intelligence standard STIX, wherein the categories comprise hacker organization, attack, network security team, malicious tools, purposes, industries, attack modes, loopholes and characteristics.

Step 1.4, a data set evaluation mechanism: and taking the marked and classified threat intelligence data set as original data, selecting 70% of original data texts as a training set, 15% as a verification set and 15% as a test set, and adopting the accuracy (P), the recall rate (R) and the F1 score as indexes for measuring the recognition performance of the ALBERT-BilSTM-CRF model.

And 2, constructing an ALBERT-BilSTM-CRF model. As shown in FIG. 1, the ALBERT-BilSTM-CRF model includes an ALBERT layer, a BilSTM layer, an Attention layer and a CRF layer

And 2.1, constructing an ALBERT layer. As shown in FIG. 2, the present invention introduces a pre-trained language model ALBERT to effectively reduce word expression ambiguity. The method uses a multilayer Transformer structure and adds an attention mechanism to perform unsupervised learning on input corpora to obtain a feature vector containing a large amount of text information in the field of threat information. The vector can better understand the meaning of words and the rich syntactic and semantic information of sentences.

Step 2.1.1, in the pre-language processing model BERT model, the word embedding parameter matrix is factorized into two small matrixes. The word vector V is mapped to the low-dimensional space E first and then to the high-dimensional hidden space H. E and H are always equal in BERT, but the word vector does not need to be of such high dimension. The word embedding parameter can be reduced from O (V multiplied by H) to O (V multiplied by E + E multiplied by H) through factorization, and when H > E, the parameter is sharply reduced, so that the model efficiency is improved more efficiently.

Step 2.2, replace NSP (Next-sense prediction) loss with SOP (sequence-order prediction) loss. The ALBERT shares the full connection layer and the Attention layer through the Transformer, and after parameters are shared, the model parameters can be effectively reduced, and the model efficiency is improved under the condition that the model performance is not remarkably influenced. The parameter calculation is shown in the following formula, where L is the number of parameter layers.

O(12×L×H×H)→O(12×H×H)

As shown in fig. 3, the method for implementing the Transformer sharing the full connection layer specifically includes:

step 2.3.1, encoding the input text information based on Self-Attention to extract vector characteristics, and adopting the following calculation formula:

wherein: q, K and V are respectively a query vector, a key vector and a value vector and are used for calculating attention weight of Q on V and finally carrying out weighted summation on all word value vectors.

And 2.3.2, adding a Multi-Head attention mechanism.

Step 2.3.3, adding a residual error network and a normalization layer in the transform, and adopting a parameter calculation formula as follows:

FNN＝max(0,xW ₁ +b ₁ )W ₂ +b ₂

where α and β are the parameters to be learned, μ is the mean, and σ is the variance of the input layer.

Step 2.3.4, the Transformer model input is expressed as three parts: a word vector, a segment vector, and a location vector. The word vector is that a word vector matrix is generated according to the dimension to represent the input word, and the word vector is changed along with the dimension of the model; the segment vector is used for a next sentence prediction task, and two sentences need to be distinguished, so the beginning is marked by a [ CLS ] symbol, and the tail of the sentence is added with [ SEP ]; the position vector marks position information, and the problem that a Transformer model cannot remember time sequence is solved.

And 2.3.5, after the character vector representation is obtained, a plurality of transform codes are carried out, and finally the output vector Xn of the ALBERT is obtained.

And 2.2, constructing an ALBERT-BilSTM layer.

The construction is performed using a bidirectional LSTM network, which includes two unidirectional networks: the LSTM forward propagation network is used for calculating forward hidden features; and the LSTM back propagation network is used for calculating the reverse hidden features.

And 2.3, combining the constructed ALBERT-BilSTM layer with an attention mechanism, and introducing an attention matrix A to calculate the relation between the current target vector and all vectors in the sequence.

And 3, training the ALBERT-BilSTM-CRF model, extracting context syntax and semantic feature information from a corpus of the threat intelligence field, and identifying the named entity of the cyber threat intelligence.

And 3.1, inputting words in sentences of the original unmarked threat intelligence field corpus as an ALBERT-BilSTM-CRF model, training on the threat intelligence corpus data which is not subjected to BIO marking through an ALBERT layer, and extracting characteristic information of context syntax and semantics to obtain dynamic word vectors.

Step 3.1.1, the input of the ALBERT-BilSTM-CRF model is expressed as three parts: the word vector is generated according to the dimension and represents an input word, and the word vector changes along with the dimension of the model; the segment vector is used for a next sentence prediction task, two sentences need to be distinguished, the beginning is marked by a [ CLS ] symbol, and the tail of the sentence is added with [ SEP ]; the position vector marks position information, and the problem that a Transformer model cannot remember a time sequence is solved;

And 3.2, inputting the word vectors into a BilSTM layer to learn the sequence characteristic information to obtain a learned text. And fusing the characteristic information calculated by the BilSTM layer to form a final hidden state, and simultaneously considering the context information to obtain a learned text.

And 3.3, weighting the word vector obtained by the BilSTM layer and the learned text vector through the Attention layer to obtain a word vector combined with Attention weight.

step 3.3.2, normalizing the result to obtain the weight of each word, and highlighting the role of keywords related to threat intelligence in the text;

Wherein Attacks and began are input word sequences, X ₁ ，X ₂ 8230n, where Xn is the output vector of the ALBERT layer and h ₁ ，h ₂ 8230hn is a BilSTM layer context representation vector, S _t Is h _t Vector weighted value of (a) ₁ ，a ₂ ，…，a _n For the output of the attention layer, 0.3, 0.9, 0.6 and 0.1 are prediction label probability values of the model, and B-HackGro, B and O are labels of prediction results of the model.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A network threat information named entity identification method based on ALBERT is characterized in that: the method comprises the following steps:

step 2, constructing an ALBERT-BilSTM-CRF model;

2. The method of claim 1, wherein the ALBERT-based network threat intelligence named entity identification method comprises: the step 1 comprises the following steps:

step 1.1, data collection: collecting and analyzing threat information reports from an open source threat information website by adopting a crawler tool, cleaning and deleting unavailable part of report entities, and constructing an original unmarked threat information field corpus data set;

step 1.2, data annotation: labeling threat information in a rib information report by using a Brat RapidAnnntation Tool, selecting an industry standard labeling form BIO in the text labeling field for labeling, and constructing a BIO-labeled threat information corpus data set;

step 1.3, data set statistics: according to the characteristics of network threat intelligence and the experience knowledge of domain experts, in combination with a threat intelligence standard STIX, 9 categories are selected and labeled in a BIO labeled threat intelligence corpus data set for classification, wherein the categories comprise hacker organization, attack, network security team, malicious tools, purposes, industry, attack modes, loopholes and characteristics;

step 1.4, data set evaluation: and taking the threat information data set after labeling and classification as original data, selecting 70% of original data text as a training set, 15% as a verification set and 15% as a test set, and adopting the scores of accuracy P, recall rate R and F1 as indexes of the ALBERT-BilSTM-CRF model for measuring and identifying performance.

3. The method of claim 1, wherein the ALBERT-based network threat intelligence named entity identification method comprises: the ALBERT-BilSTM-CRF model in the step 2 comprises an ALBERT layer, a BilSTM layer, an Attention layer and a CRF layer, and the construction method comprises the following steps:

step 2.1, constructing an ALBERT layer;

step 2.2, constructing an ALBERT-BilSTM layer;

step 2.3, combining the constructed ALBERT-BilSTM layer with an attention mechanism, and introducing an attention matrix A to calculate the relationship between the current target vector and all vectors in the sequence;

4. The ALBERT-based network threat intelligence named entity recognition method according to claim 3, characterized in that: the step 2.1 comprises the following steps:

step 2.2, replacing NSP loss by SOP loss;

5. The ALBERT-based network threat intelligence named entity recognition method according to claim 3, characterized in that: the specific implementation method of the step 2.2 is as follows: the construction is performed using a bidirectional LSTM network, which includes two unidirectional networks: the LSTM forward propagation network is used for calculating forward hidden features; the LSTM backpropagation network is used to compute the reverse hidden features.

6. The method of claim 1, wherein the ALBERT-based network threat intelligence named entity identification method comprises: the step 3 comprises the following steps:

step 3.1, inputting data of an original unmarked threat intelligence field corpus as an ALBERT-BilSTM-CRF model, training the data of the threat intelligence field corpus unmarked by BIO through an ALBERT layer, and extracting the characteristic information of context syntax and semantics to obtain a dynamic word vector;

step 3.3, weighting the word vector obtained by the ALBERT layer and the learned text vector obtained by the BilSTM layer through the Attention layer to obtain a word vector combined with Attention weight;

7. The method of claim 6, wherein the method for identifying the named entities of the cyber threat intelligence based on the ALBERT comprises the following steps: said step 3.1 comprises the steps of:

8. The method of claim 6, wherein the method for identifying the named entities of the cyber threat intelligence based on the ALBERT comprises the following steps: the specific implementation method of the step 3.2 is as follows: and fusing the characteristic information calculated by the BilSTM layer to form a final hidden state, and simultaneously considering the context information to obtain a learned text.

9. The method of claim 6, wherein the method for identifying the named entities of the cyber threat intelligence based on the ALBERT comprises the following steps: the specific implementation method of the step 3.3 is as follows: