CN116933791A - Network security entity identification method based on BERT-BiLSTM-GAM-CRF - Google Patents

Network security entity identification method based on BERT-BiLSTM-GAM-CRF Download PDF

Info

Publication number
CN116933791A
CN116933791A CN202310655134.6A CN202310655134A CN116933791A CN 116933791 A CN116933791 A CN 116933791A CN 202310655134 A CN202310655134 A CN 202310655134A CN 116933791 A CN116933791 A CN 116933791A
Authority
CN
China
Prior art keywords
bilstm
bert
crf
output
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310655134.6A
Other languages
Chinese (zh)
Inventor
尚文利
龚致贤
朱鹏程
揭海
曹忠
张曼
浣沙
张梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202310655134.6A priority Critical patent/CN116933791A/en
Publication of CN116933791A publication Critical patent/CN116933791A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network security entity identification method based on BERT-BiLSTM-GAM-CRF, which utilizes a pre-trained BERT model to acquire context information to obtain a representation vector of each word, and encodes word sequences to obtain a feature representation of more global and context perception after being input into BiLSTM, wherein GAM is used for further extracting key context representations based on the output of BiLSTM, so as to reduce irrelevant context influence, and CRF can consider the dependency relationship between the current tag state and the context, thereby obtaining a more accurate labeling result.

Description

Network security entity identification method based on BERT-BiLSTM-GAM-CRF
Technical Field
The invention relates to the technical field of network security, in particular to a network security entity identification method based on BERT-BiLSTM-GAM-CRF.
Background
Network security entity identification (Network Security Entity Recognition) is a process in the field of network security that uses natural language processing and machine learning techniques to analyze information such as network packets, logs, etc. to identify and classify and label entity information related to network security. Network security entity identification is an important task in the field of network security, and differs from traditional named entity identification (Named Entity Recognition, NER) to some extent. The method is mainly applied to the aspects of network security situation awareness, intrusion detection, threat intelligence and the like, and can help to improve the network security defense capacity and efficiency.
The NER technology has wide application in the fields of information extraction, knowledge graph construction, search engine optimization and the like. In the field of network security, NER technology is also widely used in the analysis and processing of threat intelligence. Traditional threat information processing methods mainly rely on manual collection and analysis, but due to the increasing complexity and diversity of network attack means, the traditional methods cannot meet the requirements for rapid and accurate processing of threat information. And the NER technology can effectively improve the collection and arrangement efficiency of threat information, and provide more accurate and timely threat information support for enterprises or organizations.
In recent years, a number of advanced models and techniques have been proposed and applied to named entity recognition tasks, such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), long Short-Term Memory (LSTM), etc., to automatically learn features representing an input text sequence, and perform entity recognition. This approach has high generalization and scalability, but requires a large amount of labeling data and computing resources. In addition to this, pre-trained language models such as BERT, roBERTa, ALBERT, etc., fine-tune on the NER task. The method does not need to train the model from the head, and can use large-scale unlabeled data to perform pre-training, thereby improving the generalization capability and effect of the model.
Disclosure of Invention
The invention provides an intrusion detection algorithm based on random forest, which is characterized in that an existing network intrusion data set is processed, two models of XGBoost and LightGBM are constructed, training is carried out according to the characteristics and the attributes of data, and prediction and classification are carried out by comparing the two models so as to determine whether network intrusion exists. Compared with the traditional intrusion detection algorithm, the intrusion detection algorithm based on the random forest has better generalization capability and higher accuracy.
The technical scheme of the invention is realized in the following way: the network security entity identification method based on BERT-BiLSTM-GAM-CRF comprises the following steps:
s1, acquiring text data of a dataset;
s2, extracting text features;
s3, carrying out weighted pooling on the output of the BiLSTM network;
s4, mapping the feature vector of each word onto the tag sequence of each word by using CRF decoding;
and S5, after marking is finished, outputting a processing result.
Preferably, in the step S1, text data in the dataset is converted into vector representations, and then the input vector sequence is sent to a pre-trained BERT model to obtain context information, so as to obtain representation vectors x1, x2, …, xn of each word, which are calculated by the following formula:
H bert =BERT(x 1 ,x 2 ,...,x n ) (1)
wherein Hbert Representing the output of BERT, i.e. for the input sequence x 1 ,x 2 ,...,x n Is a coded input of (a); x is x n Representing the nth word or tag in the input sequence.
Preferably, in the step S2, each word vector output by the BERT model is used as an input of a BiLSTM network, the hidden states of the forward LSTM and the reverse LSTM are respectively calculated in two directions by using a bi-directional long-short-term memory network BiLSTM, and the output sequences of the forward LSTM and the reverse LSTM are spliced, so that the word sequences are encoded to obtain a feature representation of more global and context awareness, and the output of the BiLSTM is obtained:
wherein ,respectively representing the calculation processes of the forward LSTM and the backward LSTM, H bilstm The hidden state of BiLSTM at the moment i is shown, [. Cndot.; carrying out]Representing a splice operation, and # -representing an add-by-add operation.
Preferably, in S3, based on the output of BiLSTM, a global attention mechanism is used to further extract the critical context representation and reduce the irrelevant context impact.
Preferably, the global attention mechanism includes weighted summation of hidden state vectors at all moments of the BiLSTM output to obtain a global context representation vector:
g=softmax(W g *h) (5)
wherein g represents a global attention vector, wg represents a global attention weight matrix, and h represents an output sequence of BiLSTM; the representation of the entire input sequence is obtained by performing a global attention weighted average on the output sequence of BiLSTM: wg is a weight matrix of size d x d, where d represents the output dimension of BiLSTM; h is a matrix of size n x d, where n represents the length of the input sequence and d represents the output dimension of BiLSTM; by multiplying Wg and h and performing a softmax operation on the result, a global attention vector g is obtained, which has a size of 1 xd; g represents the importance of the whole input sequence, so that the input sequence is weighted and averaged to obtain a representation of the whole input sequence.
Preferably, in S4, the feature vectors of each word are mapped onto their tag sequences using a conditional random field CRF as a decoder:
wherein y= (y) 1 ,y 2 ,...,y n ) For tag sequence, x= (x 1 ,x 2 ,...,x n ) To input a sequence f j As a characteristic function, θ j For the corresponding weight parameter, k is the number of labels, Z (x) is a normalization factor, and P (y|x) represents the probability of the output sequence y given the input sequence x.
Preferably, in the step S5, in the training process, the BERT-BiLSTM-GAM-CRF model optimizes model parameters by minimizing the distance between the labeling result and the real tag sequence; in the test process, after a text is input, the BERT-BiLSTM-GAM-CRF model can automatically identify the named entity in the text and output the type label of the named entity; and finally outputting the labeling result.
The invention recognizes DNRTI data set through threat information entity, proposes an entity recognition method based on threat information DNRTI data set based on BERT-BiLSTM-GAM-CRF model, the invention utilizes pre-trained BERT model to input context information by converting text data of DNRTI data set into vector representation, and captures rich semantics of input text; the word sequence is encoded through the BiLSTM network to obtain a more global and context-aware feature representation, and a Global Attention Mechanism (GAM) is introduced to further extract the key context representation, which can take into account the context information of the entire text, thereby more accurately identifying named entities, as compared to conventional attention mechanisms. Finally, using CRF as a decoder, CRF can better utilize context information and sequence structure for labeling than traditional rule-based or classifier-based methods. The invention has the advantages of obtaining the context information by using the pre-training model, fully mining the sequence information, introducing the global attention mechanism, using CRF as a decoder and the like, and can obtain better performance in NER tasks.
Drawings
The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.
FIG. 1 is a flow chart of an implementation of the network security entity identification method based on BERT-BiLSTM-GAM-CRF of the present invention;
FIG. 2 is a graph of a distribution of the number of tag entities in a dataset of the present invention.
Detailed Description
The network security entity identification method based on BERT-BiLSTM-GAM-CRF is described in further detail below in connection with specific embodiments, which are for comparison and explanation purposes only, and the present invention is not limited to these embodiments.
As shown in fig. 1, the network security entity identification method based on BERT-BiLSTM-GAM-CRF specifically includes the steps of:
s1, acquiring text data of a dataset;
s2, extracting text features;
s3, carrying out weighted pooling on the output of the BiLSTM network;
s4, mapping the feature vector of each word onto the tag sequence of each word by using CRF decoding;
and S5, after marking is finished, outputting a processing result.
Preferably, in the step S1, text data in the dataset is converted into vector representations, and then the input vector sequence is sent to a pre-trained BERT model to obtain context information, so as to obtain representation vectors x1, x2, …, xn of each word, which are calculated by the following formula:
H bert =BERT(x 1 ,x 2 ,...,x n ) (1)
wherein Hbert Representing the output of BERT, i.e. for the input sequence x 1 ,x 2 ,...,x n Is a coded input of (a); x is x n Representing the nth word or tag in the input sequence.
In this embodiment, the data set used in the present invention is a DNRTI (a Large-scale Dataset for Named Entity Recognition in Threat Intelligence) data set, which contains network traffic data in a real environment. The dataset had 175,220 words and all entities were classified into 13 categories, for a total of 27 tag types. There are a total of 27 tag types. 70% of original text is randomly selected as a training set, 15% as a verification set and 15% as a test set. The distribution of the number of tag entities in the dataset is shown in fig. 2.
Preferably, in the step S2, each word vector output by the BERT model is used as an input of a BiLSTM network, the hidden states of the forward LSTM and the reverse LSTM are respectively calculated in two directions by using a bi-directional long-short-term memory network BiLSTM, and the output sequences of the forward LSTM and the reverse LSTM are spliced, so that the word sequences are encoded to obtain a feature representation of more global and context awareness, and the output of the BiLSTM is obtained:
wherein ,respectively representing the calculation processes of the forward LSTM and the backward LSTM, H bilstm The hidden state of BiLSTM at the moment i is shown, [. Cndot.; carrying out]Representing a splice operation, and # -representing an add-by-add operation.
Preferably, in S3, based on the output of BiLSTM, a global attention mechanism is used to further extract the critical context representation and reduce the irrelevant context impact.
While GAM is based on the output of BiLSTM to further extract the critical context representation, reducing the irrelevant context effects, GAM can be viewed as a weighted pooling operation that sums the hidden state vectors at all times of the BiLSTM output by weighting to obtain a global context representation vector.
Preferably, the global attention mechanism includes weighted summation of hidden state vectors at all moments of the BiLSTM output to obtain a global context representation vector:
g=softmax(W g *h) (5)
wherein g represents a global attention vector, wg represents a global attention weight matrix, and h represents an output sequence of BiLSTM; the representation of the entire input sequence is obtained by performing a global attention weighted average on the output sequence of BiLSTM: wg is a weight matrix of size d x d, where d represents the output dimension of BiLSTM; h is a matrix of size n x d, where n represents the length of the input sequence and d represents the output dimension of BiLSTM; by multiplying Wg and h and performing a softmax operation on the result, a global attention vector g is obtained, which has a size of 1 xd; g represents the importance of the whole input sequence, so that the input sequence is weighted and averaged to obtain a representation of the whole input sequence.
Preferably, in S4, the feature vectors of each word are mapped onto their tag sequences using a conditional random field CRF as a decoder:
wherein y= (y) 1 ,y 2 ,...,y n ) For tag sequence, x= (x 1 ,x 2 ,...,x n ) To input a sequence f j As a characteristic function, θ j For the corresponding weight parameter, k is the number of labels, Z (x) is a normalization factor, and P (y|x) represents the probability of the output sequence y given the input sequence x.
The CRF can consider the dependency relationship between the current tag state and the context, so that a more accurate labeling result is obtained.
Preferably, in the step S5, in the training process, the BERT-BiLSTM-GAM-CRF model optimizes model parameters by minimizing the distance between the labeling result and the real tag sequence; in the test process, after a text is input, the BERT-BiLSTM-GAM-CRF model can automatically identify the named entity in the text and output the type label of the named entity; and finally outputting the labeling result.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (7)

1. The network security entity identification method based on BERT-BiLSTM-GAM-CRF is characterized by comprising the following steps:
s1, acquiring text data of a dataset;
s2, extracting text features;
s3, carrying out weighted pooling on the output of the BiLSTM network;
s4, mapping the feature vector of each word onto the tag sequence of each word by using CRF decoding;
and S5, after marking is finished, outputting a processing result.
2. The method for identifying network security entity based on BERT-BiLSTM-GAM-CRF as recited in claim 1, wherein in S1, text data in the dataset is converted into vector representation, and then the input vector sequence is fed into a pre-trained BERT model to obtain context information, resulting in a representation vector x for each word 1 ,x 2 ,…,x n The calculation is performed by the following formula:
H bert =BERT(x 1 ,x 2 ,...,x n ) (1)
wherein Hbert Representing the output of BERT, i.e. for the input sequence x 1 ,x 2 ,...,x n Is a coded input of (a); x is x n Representing the nth word or tag in the input sequence.
3. The network security entity identification method based on BERT-BiLSTM-GAM-CRF according to claim 1, wherein in S2, each word vector outputted by the BERT model is used as an input of a BiLSTM network, the hidden states of forward and reverse LSTM are calculated in two directions by using a bi-directional long-short-term memory network BiLSTM, and their output sequences are spliced, and the word sequences are encoded to obtain a feature representation of more global and context awareness, so as to obtain an output of a BiLSTM:
wherein ,respectively representing the calculation processes of the forward LSTM and the backward LSTM, H bilstm The hidden state of BiLSTM at the moment i is shown, [. Cndot.; carrying out]Representing a splice operation, and # -representing an add-by-add operation.
4. The BERT-BiLSTM-GAM-CRF based network security entity identification method of claim 1, wherein in S3, based on the output of BiLSTM, a global attention mechanism is used to further extract critical context representations and reduce irrelevant context effects.
5. The BERT-BiLSTM-GAM-CRF based network security entity identification method of claim 1, wherein the global attention mechanism comprises weighting and summing hidden state vectors at all moments of the BiLSTM output to obtain a global context representation vector:
g=softmax(W g *h) (5)
wherein g represents a global attention vector, wg represents a global attention weight matrix, and h represents an output sequence of BiLSTM; the representation of the entire input sequence is obtained by performing a global attention weighted average on the output sequence of BiLSTM: wg is a weight matrix of size d x d, where d represents the output dimension of BiLSTM; h is a matrix of size n x d, where n represents the length of the input sequence and d represents the output dimension of BiLSTM; by multiplying Wg and h and performing a softmax operation on the result, a global attention vector g is obtained, which has a size of 1 xd; g represents the importance of the whole input sequence, so that the input sequence is weighted and averaged to obtain a representation of the whole input sequence.
6. The BERT-bim-GAM-CRF based network security entity identification method according to claim 1, wherein in S4, the conditional random field CRF is used as a decoder to map feature vectors of each word onto their tag sequences:
wherein y= (y) 1 ,y 2 ,...,y n ) For tag sequence, x= (x 1 ,x 2 ,...,x n ) To input a sequence f j As a characteristic function, θ j For the corresponding weight parameter, k is the number of labels, Z (x) is a normalization factor, and P (y|x) represents the probability of the output sequence y given the input sequence x.
7. The network security entity identification method based on BERT-BiLSTM-GAM-CRF according to claim 1, wherein in S5, the BERT-BiLSTM-GAM-CRF model optimizes model parameters by minimizing the distance between the labeling result and the real tag sequence during training; in the test process, after a text is input, the BERT-BiLSTM-GAM-CRF model can automatically identify the named entity in the text and output the type label of the named entity; and finally outputting the labeling result.
CN202310655134.6A 2023-06-02 2023-06-02 Network security entity identification method based on BERT-BiLSTM-GAM-CRF Pending CN116933791A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310655134.6A CN116933791A (en) 2023-06-02 2023-06-02 Network security entity identification method based on BERT-BiLSTM-GAM-CRF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310655134.6A CN116933791A (en) 2023-06-02 2023-06-02 Network security entity identification method based on BERT-BiLSTM-GAM-CRF

Publications (1)

Publication Number Publication Date
CN116933791A true CN116933791A (en) 2023-10-24

Family

ID=88374570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310655134.6A Pending CN116933791A (en) 2023-06-02 2023-06-02 Network security entity identification method based on BERT-BiLSTM-GAM-CRF

Country Status (1)

Country Link
CN (1) CN116933791A (en)

Similar Documents

Publication Publication Date Title
CN111859978B (en) Deep learning-based emotion text generation method
CN113051929A (en) Entity relationship extraction method based on fine-grained semantic information enhancement
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN113705218B (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN111506732A (en) Text multi-level label classification method
CN115292568B (en) Civil news event extraction method based on joint model
CN113656700A (en) Hash retrieval method based on multi-similarity consistent matrix decomposition
Wu et al. TDv2: a novel tree-structured decoder for offline mathematical expression recognition
CN115328782A (en) Semi-supervised software defect prediction method based on graph representation learning and knowledge distillation
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN116385946B (en) Video-oriented target fragment positioning method, system, storage medium and equipment
CN117516937A (en) Rolling bearing unknown fault detection method based on multi-mode feature fusion enhancement
CN115422945A (en) Rumor detection method and system integrating emotion mining
CN115934883A (en) Entity relation joint extraction method based on semantic enhancement and multi-feature fusion
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN116933791A (en) Network security entity identification method based on BERT-BiLSTM-GAM-CRF
CN113377908B (en) Method for extracting aspect-level emotion triple based on learnable multi-word pair scorer
CN114676252A (en) Extreme multi-label learning method based on space-time network clustering reduction integration
CN114691895A (en) Criminal case entity relationship joint extraction method based on pointer network
CN114896969A (en) Method for extracting aspect words based on deep learning
CN115169363A (en) Knowledge-fused incremental coding dialogue emotion recognition method
CN113505937A (en) Multi-view encoder-based legal decision prediction system and method
Sun et al. Task-Oriented Explainable Semantic Communications Based on Structured Scene Graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination