CN116933791A

CN116933791A - Network security entity identification method based on BERT-BiLSTM-GAM-CRF

Info

Publication number: CN116933791A
Application number: CN202310655134.6A
Authority: CN
Inventors: 尚文利; 龚致贤; 朱鹏程; 揭海; 曹忠; 张曼; 浣沙; 张梦
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-10-24

Abstract

The invention discloses a network security entity identification method based on BERT-BiLSTM-GAM-CRF, which utilizes a pre-trained BERT model to acquire context information to obtain a representation vector of each word, and encodes word sequences to obtain a feature representation of more global and context perception after being input into BiLSTM, wherein GAM is used for further extracting key context representations based on the output of BiLSTM, so as to reduce irrelevant context influence, and CRF can consider the dependency relationship between the current tag state and the context, thereby obtaining a more accurate labeling result.

Description

Network security entity identification method based on BERT-BiLSTM-GAM-CRF

Technical Field

The invention relates to the technical field of network security, in particular to a network security entity identification method based on BERT-BiLSTM-GAM-CRF.

Background

Network security entity identification (Network Security Entity Recognition) is a process in the field of network security that uses natural language processing and machine learning techniques to analyze information such as network packets, logs, etc. to identify and classify and label entity information related to network security. Network security entity identification is an important task in the field of network security, and differs from traditional named entity identification (Named Entity Recognition, NER) to some extent. The method is mainly applied to the aspects of network security situation awareness, intrusion detection, threat intelligence and the like, and can help to improve the network security defense capacity and efficiency.

The NER technology has wide application in the fields of information extraction, knowledge graph construction, search engine optimization and the like. In the field of network security, NER technology is also widely used in the analysis and processing of threat intelligence. Traditional threat information processing methods mainly rely on manual collection and analysis, but due to the increasing complexity and diversity of network attack means, the traditional methods cannot meet the requirements for rapid and accurate processing of threat information. And the NER technology can effectively improve the collection and arrangement efficiency of threat information, and provide more accurate and timely threat information support for enterprises or organizations.

In recent years, a number of advanced models and techniques have been proposed and applied to named entity recognition tasks, such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), long Short-Term Memory (LSTM), etc., to automatically learn features representing an input text sequence, and perform entity recognition. This approach has high generalization and scalability, but requires a large amount of labeling data and computing resources. In addition to this, pre-trained language models such as BERT, roBERTa, ALBERT, etc., fine-tune on the NER task. The method does not need to train the model from the head, and can use large-scale unlabeled data to perform pre-training, thereby improving the generalization capability and effect of the model.

Disclosure of Invention

The invention provides an intrusion detection algorithm based on random forest, which is characterized in that an existing network intrusion data set is processed, two models of XGBoost and LightGBM are constructed, training is carried out according to the characteristics and the attributes of data, and prediction and classification are carried out by comparing the two models so as to determine whether network intrusion exists. Compared with the traditional intrusion detection algorithm, the intrusion detection algorithm based on the random forest has better generalization capability and higher accuracy.

The technical scheme of the invention is realized in the following way: the network security entity identification method based on BERT-BiLSTM-GAM-CRF comprises the following steps:

s1, acquiring text data of a dataset;

s2, extracting text features;

s3, carrying out weighted pooling on the output of the BiLSTM network;

s4, mapping the feature vector of each word onto the tag sequence of each word by using CRF decoding;

and S5, after marking is finished, outputting a processing result.

Preferably, in the step S1, text data in the dataset is converted into vector representations, and then the input vector sequence is sent to a pre-trained BERT model to obtain context information, so as to obtain representation vectors x1, x2, …, xn of each word, which are calculated by the following formula:

H _bert ＝BERT(x ₁ ,x ₂ ,...,x _n ) (1)

wherein H_bert Representing the output of BERT, i.e. for the input sequence x ₁ ,x ₂ ,...,x _n Is a coded input of (a); x is x _n Representing the nth word or tag in the input sequence.

Preferably, in the step S2, each word vector output by the BERT model is used as an input of a BiLSTM network, the hidden states of the forward LSTM and the reverse LSTM are respectively calculated in two directions by using a bi-directional long-short-term memory network BiLSTM, and the output sequences of the forward LSTM and the reverse LSTM are spliced, so that the word sequences are encoded to obtain a feature representation of more global and context awareness, and the output of the BiLSTM is obtained:

wherein ,respectively representing the calculation processes of the forward LSTM and the backward LSTM, H _bilstm The hidden state of BiLSTM at the moment i is shown, [. Cndot.; carrying out]Representing a splice operation, and # -representing an add-by-add operation.

Preferably, in S3, based on the output of BiLSTM, a global attention mechanism is used to further extract the critical context representation and reduce the irrelevant context impact.

Preferably, the global attention mechanism includes weighted summation of hidden state vectors at all moments of the BiLSTM output to obtain a global context representation vector:

g＝softmax(W _g *h) (5)

wherein g represents a global attention vector, wg represents a global attention weight matrix, and h represents an output sequence of BiLSTM; the representation of the entire input sequence is obtained by performing a global attention weighted average on the output sequence of BiLSTM: wg is a weight matrix of size d x d, where d represents the output dimension of BiLSTM; h is a matrix of size n x d, where n represents the length of the input sequence and d represents the output dimension of BiLSTM; by multiplying Wg and h and performing a softmax operation on the result, a global attention vector g is obtained, which has a size of 1 xd; g represents the importance of the whole input sequence, so that the input sequence is weighted and averaged to obtain a representation of the whole input sequence.

Preferably, in S4, the feature vectors of each word are mapped onto their tag sequences using a conditional random field CRF as a decoder:

wherein y= (y) ₁ ,y ₂ ,...,y _n ) For tag sequence, x= (x ₁ ,x ₂ ,...,x _n ) To input a sequence f _j As a characteristic function, θ _j For the corresponding weight parameter, k is the number of labels, Z (x) is a normalization factor, and P (y|x) represents the probability of the output sequence y given the input sequence x.

Preferably, in the step S5, in the training process, the BERT-BiLSTM-GAM-CRF model optimizes model parameters by minimizing the distance between the labeling result and the real tag sequence; in the test process, after a text is input, the BERT-BiLSTM-GAM-CRF model can automatically identify the named entity in the text and output the type label of the named entity; and finally outputting the labeling result.

The invention recognizes DNRTI data set through threat information entity, proposes an entity recognition method based on threat information DNRTI data set based on BERT-BiLSTM-GAM-CRF model, the invention utilizes pre-trained BERT model to input context information by converting text data of DNRTI data set into vector representation, and captures rich semantics of input text; the word sequence is encoded through the BiLSTM network to obtain a more global and context-aware feature representation, and a Global Attention Mechanism (GAM) is introduced to further extract the key context representation, which can take into account the context information of the entire text, thereby more accurately identifying named entities, as compared to conventional attention mechanisms. Finally, using CRF as a decoder, CRF can better utilize context information and sequence structure for labeling than traditional rule-based or classifier-based methods. The invention has the advantages of obtaining the context information by using the pre-training model, fully mining the sequence information, introducing the global attention mechanism, using CRF as a decoder and the like, and can obtain better performance in NER tasks.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

FIG. 1 is a flow chart of an implementation of the network security entity identification method based on BERT-BiLSTM-GAM-CRF of the present invention;

FIG. 2 is a graph of a distribution of the number of tag entities in a dataset of the present invention.

Detailed Description

The network security entity identification method based on BERT-BiLSTM-GAM-CRF is described in further detail below in connection with specific embodiments, which are for comparison and explanation purposes only, and the present invention is not limited to these embodiments.

As shown in fig. 1, the network security entity identification method based on BERT-BiLSTM-GAM-CRF specifically includes the steps of:

s1, acquiring text data of a dataset;

s2, extracting text features;

s3, carrying out weighted pooling on the output of the BiLSTM network;

and S5, after marking is finished, outputting a processing result.

H _bert ＝BERT(x ₁ ,x ₂ ,...,x _n ) (1)

In this embodiment, the data set used in the present invention is a DNRTI (a Large-scale Dataset for Named Entity Recognition in Threat Intelligence) data set, which contains network traffic data in a real environment. The dataset had 175,220 words and all entities were classified into 13 categories, for a total of 27 tag types. There are a total of 27 tag types. 70% of original text is randomly selected as a training set, 15% as a verification set and 15% as a test set. The distribution of the number of tag entities in the dataset is shown in fig. 2.

While GAM is based on the output of BiLSTM to further extract the critical context representation, reducing the irrelevant context effects, GAM can be viewed as a weighted pooling operation that sums the hidden state vectors at all times of the BiLSTM output by weighting to obtain a global context representation vector.

g＝softmax(W _g *h) (5)

The CRF can consider the dependency relationship between the current tag state and the context, so that a more accurate labeling result is obtained.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The network security entity identification method based on BERT-BiLSTM-GAM-CRF is characterized by comprising the following steps:

s1, acquiring text data of a dataset;

s2, extracting text features;

s3, carrying out weighted pooling on the output of the BiLSTM network;

and S5, after marking is finished, outputting a processing result.

2. The method for identifying network security entity based on BERT-BiLSTM-GAM-CRF as recited in claim 1, wherein in S1, text data in the dataset is converted into vector representation, and then the input vector sequence is fed into a pre-trained BERT model to obtain context information, resulting in a representation vector x for each word ₁ ,x ₂ ,…,x _n The calculation is performed by the following formula:

H _bert ＝BERT(x ₁ ,x ₂ ,...,x _n ) (1)

3. The network security entity identification method based on BERT-BiLSTM-GAM-CRF according to claim 1, wherein in S2, each word vector outputted by the BERT model is used as an input of a BiLSTM network, the hidden states of forward and reverse LSTM are calculated in two directions by using a bi-directional long-short-term memory network BiLSTM, and their output sequences are spliced, and the word sequences are encoded to obtain a feature representation of more global and context awareness, so as to obtain an output of a BiLSTM:

4. The BERT-BiLSTM-GAM-CRF based network security entity identification method of claim 1, wherein in S3, based on the output of BiLSTM, a global attention mechanism is used to further extract critical context representations and reduce irrelevant context effects.

5. The BERT-BiLSTM-GAM-CRF based network security entity identification method of claim 1, wherein the global attention mechanism comprises weighting and summing hidden state vectors at all moments of the BiLSTM output to obtain a global context representation vector:

g＝softmax(W _g *h) (5)

6. The BERT-bim-GAM-CRF based network security entity identification method according to claim 1, wherein in S4, the conditional random field CRF is used as a decoder to map feature vectors of each word onto their tag sequences:

7. The network security entity identification method based on BERT-BiLSTM-GAM-CRF according to claim 1, wherein in S5, the BERT-BiLSTM-GAM-CRF model optimizes model parameters by minimizing the distance between the labeling result and the real tag sequence during training; in the test process, after a text is input, the BERT-BiLSTM-GAM-CRF model can automatically identify the named entity in the text and output the type label of the named entity; and finally outputting the labeling result.