CN113886524A

CN113886524A - Network security threat event extraction method based on short text

Info

Publication number: CN113886524A
Application number: CN202111129374.XA
Authority: CN
Inventors: 黄诚; 高健; 方勇; 欧浩然
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-01-04

Abstract

The application relates to a network security threat event extraction method based on short texts, wherein the extracted object is short text information published by a social media platform. The technical core of the application is a feature fusion event detection method based on the BilSTM and the attention mechanism, a network security threat event element identification method based on multiple elements and a multi-task event extraction method based on a combined model. The method comprises the working procedures of firstly using multi-dimensional integrated word vectors as important features to detect the network security threat event, and simultaneously researching network security threat event element identification by using a rule template-based and deep learning methods. In addition, the two subtasks are adopted in the multitasking process based on the joint model, and finally, the extraction of the network security threat event is completed in a non-pipeline mode.

Description

Network security threat event extraction method based on short text

Technical Field

The invention relates to the field of network security threat events, in particular to a network security threat event extraction method.

Background

Various network security events with extremely high aggressivity occur frequently in the world, the attack range is wider and wider, and the network security situation is very severe. Due to the daily work and living needs of people, the application of the network social media is more and more extensive, and the data information in various social media platforms is rapidly increased. News and network media organizations also establish official accounts on social platforms and publish real-time news, wherein a plurality of network security companies and individuals publish news about network attack events for the first time. How to utilize mass media information published in the network to effectively obtain and extract intuitive network security threat event content is very important help network security practitioners to know about relevant network security threat events and actively implement network security defense.

The existing network security threat event detection technology and the wide-area event extraction technology mainly have the following two problems:

(1) due to different research fields, when the related technology on a wide area is directly applied to the field of network security threat event extraction, the problems of poor entity extraction effect and inaccurate event detection can occur;

(2) the uniqueness of social media and the social media information have the characteristics of short information published by a user, serious word spoken language conversion of the user and poor standard connectivity of a text, so that the extraction difficulty of key information is greatly increased.

Aiming at the problems of too short text, too much spoken vocabulary and poor information connectivity of short text of social media in the field of network security threat event extraction, a short text-based network security threat event extraction method is urgently needed at present, can effectively extract network security threat events from short text information issued by a social media platform, and provides effective help for security personnel to timely deal with the network threat events and actively implement network security defense.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method for extracting cyber security threat events based on short texts, which aims to solve the problems of event detection and event element identification in the cyber security threat event extraction. The embodiment of the application provides a network security threat event extraction method based on a short text, which is applied to effectively extracting network security threat event information appearing in the field of social media platform short texts; the method comprises the following steps:

and collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification.

According to the generated corpus, performing word level vector embedding on a text by using a plurality of word vector models, acquiring text keywords by using an LDA (latent Dirichlet Allocation) topic model to perform text sentence level vector embedding, and finishing the text feature vector representation of multi-dimensional integrated coding for event detection; and (3) constructing an event detection model by using a BilSTM and an attention mechanism, comprehensively acquiring deep semantic features of the text, and efficiently and accurately detecting the event.

Meanwhile, carrying out integrated coding by using Word vector models pre-trained by Word2vec, Glove and FastText to obtain feature vectors at Word level; obtaining a topic keyword of a text as a text level feature by using an IDA topic classification model; acquiring the two feature vectors as the input of a BilSTM and an attention mechanism, and training a network security threat event detection model;

the network security threat field event element identification model based on the multiple factors completes accurate and efficient identification of event elements according to the characteristics of different event elements.

Firstly, identifying network security named entities of corresponding types by using a rule template-based method, then shielding the known named entities, and training a network security threat event element identification model based on the BilSTM and the void convolutional neural network.

And the two subtasks of the network security threat event extraction are effectively completed by using the combined model, and finally the short text network security threat event extraction can be reasonably completed.

The network complete threat event detection model completes the detection of the event, the multi-element network security threat event element identification model completes the identification of the event element, and finally different event templates are established according to the event type to perform template filling on the event element to complete the construction of the whole event.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a method for extracting a network security threat event based on short text according to an embodiment of the present application;

fig. 2 is a schematic flowchart of step S12 according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for extracting a short text-based cyber-security threat event according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

step S11: and collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification.

In this embodiment, an original data set for training and testing a network security threat event model and an event element recognition model is collected, and the original data set is cleaned and labeled to generate a corpus applicable to the two models.

Illustratively, the source of the collected original data set is to collect relevant data from a tweet of a Twint social platform in a keyword form by using a crawler technology; and the other collected related public data sets are stored in a CSV format.

Illustratively, the data cleaning of the original data set is to remove repeated texts and screen out texts with too short and no information content; the data annotation is automatic pre-annotation by using a Stanford named entity recognition tool, specific word matching annotation in the field of network security threat events and manual annotation by a Brat open source system; and clearing the marked data again, clearing the stop words, articles and generating a corpus suitable for detecting the network security threat event and identifying the event elements.

Step S12: the method comprises the steps of adopting an integrated coding mode, using different word vector pre-training models to obtain word-level feature vectors, using an LDA topic model to extract topic keywords of a text to obtain text-level feature vectors, and using a BilSTM and attention mechanism to train the models.

In the embodiment, Word vector models pre-trained by Word2vec, Glove and FastText are used for integrated coding to obtain the feature vectors of the Word level; obtaining a topic keyword of a text as a text level feature by using an IDA topic classification model; and acquiring the two feature vectors as the input of the BilSTM and the attention mechanism, and training a network security threat event detection model.

Aiming at the obtained word level feature vector and text level vector, a network security threat event detection model is constructed and the following steps are executed:

step S12 a: counting the data set to obtain a Word dictionary, and obtaining a Word-level feature vector by using a Word2vec, Glove and FastText Word vector pre-training model in an integrated coding mode;

step S12 b: constructing an LDA topic model by using a Gensim library of python, traversing different topic numbers to obtain different topic models, measuring the advantages and disadvantages of the topic models by using topic consistency, and extracting topic keywords of a text to obtain a text-level feature vector;

step S12 c: word embedding is carried out on the data set, and context and key part characteristics of the text are obtained by using a self-attention mechanism and BilSTM respectively for the embedded result.

Illustratively, the results are combined using the LSTM neural network, overfitting is prevented using the Droput layer, and full-join stitching is performed on the density layer.

Illustratively, in order to ensure good performance of the model, word-level features and text-level features are used for model classification training, a verification set is used for carrying out hyper-parameter optimization adjustment on the model, and a test set is used for evaluating the detection effect of the model.

Step S13: the network security threat event element identification model extracts event elements from the text by using two different methods according to the composition characteristics of the network security threat event elements.

In the embodiment, firstly, the network security named entities of corresponding types are identified by using a rule template-based method, then the known named entities are shielded, and a network security threat event element identification model based on the BilSTM and the void convolutional neural network is trained.

Exemplarily, several event elements, such as an IP, a URL, a vulnerability number, a mailbox and a version number, have very obvious structural characteristics, and are extracted by using a regular expression; name, organization and vulnerability terms, and the characteristics are automatically extracted based on a cavity convolutional neural network and a BilSTM method without obvious structural characteristics, and finally network threat event elements are obtained.

Illustratively, the text is assembled after masking known entities. And then inputting and establishing a neural network, wherein the output of the BilSTM and the IDCNN can be regarded as a state feature matrix of the text, and the output result of the label is constrained by using label transition probability after the CRF layer is connected.

Illustratively, during the training process, Val _ loss (loss function value of a verification set) is monitored in real time by using Checkpoint technology as a result reference, and the optimal model during the training process is saved.

Step S14: and establishing a combined model to finish the extraction of the network security threat event.

In the embodiment, the network complete threat event detection model completes the detection of the event, the multi-element network security threat event element identification model completes the identification of the event element, and finally different event templates are established according to the event types to perform template filling on the event element to complete the construction of the whole event.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

The network security threat event extraction method provided by the application is introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for extracting network security threat events based on short texts is characterized by comprising the following steps: a. Collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification;

B. adopting an integrated coding mode, using different word vector pre-training models to obtain word-level feature vectors, using an LDA topic model to extract topic keywords of a text to obtain text-level feature vectors, and using a BilSTM and attention mechanism to train the models;

C. the network security threat event element identification model extracts event elements from the text by using two different methods according to the composition characteristics of the network security threat event elements;

D. and establishing a combined model to finish the extraction of the network security threat event.

2. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the data collection process in the step a comprises the following steps:

(1) in the sample collection stage, a crawler technology is used, a Twint media platform is used as an object, a Twent library is used for searching samples from a tweet in a keyword mode, a large number of repeated or content-similar texts are removed, and the texts which are too short and have no information content are screened;

(2) in the automatic data pre-labeling stage, a Stanford named entity recognition tool is used for pre-labeling named entities of data, key terms in the field of network security threat events are collected, and a corresponding dictionary is formed for carrying out specific word and word matching labeling;

(3) in the manual data labeling and data cleaning stages, a Brat open source system is used for correcting entity labeling, cleaning data again after manual labeling, and cleaning stop words and articles.

3. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the cyber security threat event detection model in the step B is constructed by the following steps:

(1) counting the data set to obtain a word dictionary, and embedding word vectors at a word level by using pre-trained Glove, Wordvec and FastText word vector models;

(2) acquiring feature vectors at a text level, constructing an LDA model by using a Gensim library of python, traversing different theme numbers to obtain different theme models, and measuring the quality of the theme models by using the consistency of themes;

(3) constructing a detection model, performing word embedding on a data set by using Glove, Wordvec and FastText, respectively using a self-attention mechanism and BilsTM to acquire context and key part characteristics of a text for an embedded result, combining the results by using an LSTM neural network, using a Droput layer to prevent overfitting, and performing full-connection splicing on a Dense layer.

4. The method for extracting cyber security threat events based on short texts according to claim 1, wherein in the step C, two methods for extracting event elements from texts are as follows:

(1) aiming at named entities such as IP, URL, vulnerability numbers and mailboxes, a rule template mode is adopted for extraction;

(2) correspondingly shielding event elements matched by the rule template, and collecting texts after shielding known entities; and then inputting and establishing a neural network, taking the outputs of the BilSTM and the IDCNN as a state feature matrix of the text, and connecting a CRF layer to restrict the output result of the label by using the label transfer probability.

5. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the method for establishing the joint model in the step D comprises the following steps:

(1) the network complete threat event detection model completes the detection of the event, and the multi-element network security threat event element identification model completes the identification of the event element;

(2) and establishing different event templates according to the event types, and performing template filling on the event elements to complete the construction of the whole event.