CN113886524A - Network security threat event extraction method based on short text - Google Patents

Network security threat event extraction method based on short text Download PDF

Info

Publication number
CN113886524A
CN113886524A CN202111129374.XA CN202111129374A CN113886524A CN 113886524 A CN113886524 A CN 113886524A CN 202111129374 A CN202111129374 A CN 202111129374A CN 113886524 A CN113886524 A CN 113886524A
Authority
CN
China
Prior art keywords
security threat
event
network security
text
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111129374.XA
Other languages
Chinese (zh)
Inventor
黄诚
高健
方勇
欧浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202111129374.XA priority Critical patent/CN113886524A/en
Publication of CN113886524A publication Critical patent/CN113886524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The application relates to a network security threat event extraction method based on short texts, wherein the extracted object is short text information published by a social media platform. The technical core of the application is a feature fusion event detection method based on the BilSTM and the attention mechanism, a network security threat event element identification method based on multiple elements and a multi-task event extraction method based on a combined model. The method comprises the working procedures of firstly using multi-dimensional integrated word vectors as important features to detect the network security threat event, and simultaneously researching network security threat event element identification by using a rule template-based and deep learning methods. In addition, the two subtasks are adopted in the multitasking process based on the joint model, and finally, the extraction of the network security threat event is completed in a non-pipeline mode.

Description

Network security threat event extraction method based on short text
Technical Field
The invention relates to the field of network security threat events, in particular to a network security threat event extraction method.
Background
Various network security events with extremely high aggressivity occur frequently in the world, the attack range is wider and wider, and the network security situation is very severe. Due to the daily work and living needs of people, the application of the network social media is more and more extensive, and the data information in various social media platforms is rapidly increased. News and network media organizations also establish official accounts on social platforms and publish real-time news, wherein a plurality of network security companies and individuals publish news about network attack events for the first time. How to utilize mass media information published in the network to effectively obtain and extract intuitive network security threat event content is very important help network security practitioners to know about relevant network security threat events and actively implement network security defense.
The existing network security threat event detection technology and the wide-area event extraction technology mainly have the following two problems:
(1) due to different research fields, when the related technology on a wide area is directly applied to the field of network security threat event extraction, the problems of poor entity extraction effect and inaccurate event detection can occur;
(2) the uniqueness of social media and the social media information have the characteristics of short information published by a user, serious word spoken language conversion of the user and poor standard connectivity of a text, so that the extraction difficulty of key information is greatly increased.
Aiming at the problems of too short text, too much spoken vocabulary and poor information connectivity of short text of social media in the field of network security threat event extraction, a short text-based network security threat event extraction method is urgently needed at present, can effectively extract network security threat events from short text information issued by a social media platform, and provides effective help for security personnel to timely deal with the network threat events and actively implement network security defense.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method for extracting cyber security threat events based on short texts, which aims to solve the problems of event detection and event element identification in the cyber security threat event extraction. The embodiment of the application provides a network security threat event extraction method based on a short text, which is applied to effectively extracting network security threat event information appearing in the field of social media platform short texts; the method comprises the following steps:
and collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification.
According to the generated corpus, performing word level vector embedding on a text by using a plurality of word vector models, acquiring text keywords by using an LDA (latent Dirichlet Allocation) topic model to perform text sentence level vector embedding, and finishing the text feature vector representation of multi-dimensional integrated coding for event detection; and (3) constructing an event detection model by using a BilSTM and an attention mechanism, comprehensively acquiring deep semantic features of the text, and efficiently and accurately detecting the event.
Meanwhile, carrying out integrated coding by using Word vector models pre-trained by Word2vec, Glove and FastText to obtain feature vectors at Word level; obtaining a topic keyword of a text as a text level feature by using an IDA topic classification model; acquiring the two feature vectors as the input of a BilSTM and an attention mechanism, and training a network security threat event detection model;
the network security threat field event element identification model based on the multiple factors completes accurate and efficient identification of event elements according to the characteristics of different event elements.
Firstly, identifying network security named entities of corresponding types by using a rule template-based method, then shielding the known named entities, and training a network security threat event element identification model based on the BilSTM and the void convolutional neural network.
And the two subtasks of the network security threat event extraction are effectively completed by using the combined model, and finally the short text network security threat event extraction can be reasonably completed.
The network complete threat event detection model completes the detection of the event, the multi-element network security threat event element identification model completes the identification of the event element, and finally different event templates are established according to the event type to perform template filling on the event element to complete the construction of the whole event.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a method for extracting a network security threat event based on short text according to an embodiment of the present application;
fig. 2 is a schematic flowchart of step S12 according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for extracting a short text-based cyber-security threat event according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S11: and collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification.
In this embodiment, an original data set for training and testing a network security threat event model and an event element recognition model is collected, and the original data set is cleaned and labeled to generate a corpus applicable to the two models.
Illustratively, the source of the collected original data set is to collect relevant data from a tweet of a Twint social platform in a keyword form by using a crawler technology; and the other collected related public data sets are stored in a CSV format.
Illustratively, the data cleaning of the original data set is to remove repeated texts and screen out texts with too short and no information content; the data annotation is automatic pre-annotation by using a Stanford named entity recognition tool, specific word matching annotation in the field of network security threat events and manual annotation by a Brat open source system; and clearing the marked data again, clearing the stop words, articles and generating a corpus suitable for detecting the network security threat event and identifying the event elements.
Step S12: the method comprises the steps of adopting an integrated coding mode, using different word vector pre-training models to obtain word-level feature vectors, using an LDA topic model to extract topic keywords of a text to obtain text-level feature vectors, and using a BilSTM and attention mechanism to train the models.
In the embodiment, Word vector models pre-trained by Word2vec, Glove and FastText are used for integrated coding to obtain the feature vectors of the Word level; obtaining a topic keyword of a text as a text level feature by using an IDA topic classification model; and acquiring the two feature vectors as the input of the BilSTM and the attention mechanism, and training a network security threat event detection model.
Aiming at the obtained word level feature vector and text level vector, a network security threat event detection model is constructed and the following steps are executed:
step S12 a: counting the data set to obtain a Word dictionary, and obtaining a Word-level feature vector by using a Word2vec, Glove and FastText Word vector pre-training model in an integrated coding mode;
step S12 b: constructing an LDA topic model by using a Gensim library of python, traversing different topic numbers to obtain different topic models, measuring the advantages and disadvantages of the topic models by using topic consistency, and extracting topic keywords of a text to obtain a text-level feature vector;
step S12 c: word embedding is carried out on the data set, and context and key part characteristics of the text are obtained by using a self-attention mechanism and BilSTM respectively for the embedded result.
Illustratively, the results are combined using the LSTM neural network, overfitting is prevented using the Droput layer, and full-join stitching is performed on the density layer.
Illustratively, in order to ensure good performance of the model, word-level features and text-level features are used for model classification training, a verification set is used for carrying out hyper-parameter optimization adjustment on the model, and a test set is used for evaluating the detection effect of the model.
Step S13: the network security threat event element identification model extracts event elements from the text by using two different methods according to the composition characteristics of the network security threat event elements.
In the embodiment, firstly, the network security named entities of corresponding types are identified by using a rule template-based method, then the known named entities are shielded, and a network security threat event element identification model based on the BilSTM and the void convolutional neural network is trained.
Exemplarily, several event elements, such as an IP, a URL, a vulnerability number, a mailbox and a version number, have very obvious structural characteristics, and are extracted by using a regular expression; name, organization and vulnerability terms, and the characteristics are automatically extracted based on a cavity convolutional neural network and a BilSTM method without obvious structural characteristics, and finally network threat event elements are obtained.
Illustratively, the text is assembled after masking known entities. And then inputting and establishing a neural network, wherein the output of the BilSTM and the IDCNN can be regarded as a state feature matrix of the text, and the output result of the label is constrained by using label transition probability after the CRF layer is connected.
Illustratively, during the training process, Val _ loss (loss function value of a verification set) is monitored in real time by using Checkpoint technology as a result reference, and the optimal model during the training process is saved.
Step S14: and establishing a combined model to finish the extraction of the network security threat event.
In the embodiment, the network complete threat event detection model completes the detection of the event, the multi-element network security threat event element identification model completes the identification of the event element, and finally different event templates are established according to the event types to perform template filling on the event element to complete the construction of the whole event.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
The network security threat event extraction method provided by the application is introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (5)

1. A method for extracting network security threat events based on short texts is characterized by comprising the following steps: a. Collecting the original data set, cleaning and labeling the data set to generate a corpus suitable for network security threat event detection and event element identification;
B. adopting an integrated coding mode, using different word vector pre-training models to obtain word-level feature vectors, using an LDA topic model to extract topic keywords of a text to obtain text-level feature vectors, and using a BilSTM and attention mechanism to train the models;
C. the network security threat event element identification model extracts event elements from the text by using two different methods according to the composition characteristics of the network security threat event elements;
D. and establishing a combined model to finish the extraction of the network security threat event.
2. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the data collection process in the step a comprises the following steps:
(1) in the sample collection stage, a crawler technology is used, a Twint media platform is used as an object, a Twent library is used for searching samples from a tweet in a keyword mode, a large number of repeated or content-similar texts are removed, and the texts which are too short and have no information content are screened;
(2) in the automatic data pre-labeling stage, a Stanford named entity recognition tool is used for pre-labeling named entities of data, key terms in the field of network security threat events are collected, and a corresponding dictionary is formed for carrying out specific word and word matching labeling;
(3) in the manual data labeling and data cleaning stages, a Brat open source system is used for correcting entity labeling, cleaning data again after manual labeling, and cleaning stop words and articles.
3. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the cyber security threat event detection model in the step B is constructed by the following steps:
(1) counting the data set to obtain a word dictionary, and embedding word vectors at a word level by using pre-trained Glove, Wordvec and FastText word vector models;
(2) acquiring feature vectors at a text level, constructing an LDA model by using a Gensim library of python, traversing different theme numbers to obtain different theme models, and measuring the quality of the theme models by using the consistency of themes;
(3) constructing a detection model, performing word embedding on a data set by using Glove, Wordvec and FastText, respectively using a self-attention mechanism and BilsTM to acquire context and key part characteristics of a text for an embedded result, combining the results by using an LSTM neural network, using a Droput layer to prevent overfitting, and performing full-connection splicing on a Dense layer.
4. The method for extracting cyber security threat events based on short texts according to claim 1, wherein in the step C, two methods for extracting event elements from texts are as follows:
(1) aiming at named entities such as IP, URL, vulnerability numbers and mailboxes, a rule template mode is adopted for extraction;
(2) correspondingly shielding event elements matched by the rule template, and collecting texts after shielding known entities; and then inputting and establishing a neural network, taking the outputs of the BilSTM and the IDCNN as a state feature matrix of the text, and connecting a CRF layer to restrict the output result of the label by using the label transfer probability.
5. The method for extracting cyber security threat events based on short texts according to claim 1, wherein the method for establishing the joint model in the step D comprises the following steps:
(1) the network complete threat event detection model completes the detection of the event, and the multi-element network security threat event element identification model completes the identification of the event element;
(2) and establishing different event templates according to the event types, and performing template filling on the event elements to complete the construction of the whole event.
CN202111129374.XA 2021-09-26 2021-09-26 Network security threat event extraction method based on short text Pending CN113886524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111129374.XA CN113886524A (en) 2021-09-26 2021-09-26 Network security threat event extraction method based on short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111129374.XA CN113886524A (en) 2021-09-26 2021-09-26 Network security threat event extraction method based on short text

Publications (1)

Publication Number Publication Date
CN113886524A true CN113886524A (en) 2022-01-04

Family

ID=79006812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111129374.XA Pending CN113886524A (en) 2021-09-26 2021-09-26 Network security threat event extraction method based on short text

Country Status (1)

Country Link
CN (1) CN113886524A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580738A (en) * 2022-03-03 2022-06-03 厦门大学 Social media crisis event prediction method and system
TWI822388B (en) * 2022-10-12 2023-11-11 財團法人資訊工業策進會 Labeling method for information security protection detection rules and tactic, technique and procedure labeling device for the same

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism
US20190394215A1 (en) * 2018-06-21 2019-12-26 Electronics And Telecommunications Research Institute Method and apparatus for detecting cyber threats using deep neural network
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN112199496A (en) * 2020-08-05 2021-01-08 广西大学 Power grid equipment defect text classification method based on multi-head attention mechanism and RCNN (Rich coupled neural network)
CN112613305A (en) * 2020-12-27 2021-04-06 北京工业大学 Chinese event extraction method based on cyclic neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism
US20190394215A1 (en) * 2018-06-21 2019-12-26 Electronics And Telecommunications Research Institute Method and apparatus for detecting cyber threats using deep neural network
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN112199496A (en) * 2020-08-05 2021-01-08 广西大学 Power grid equipment defect text classification method based on multi-head attention mechanism and RCNN (Rich coupled neural network)
CN112613305A (en) * 2020-12-27 2021-04-06 北京工业大学 Chinese event extraction method based on cyclic neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SEMIH YAGCIOGLU等: "Detecting Cybersecurity Events from Noisy Short Text", 《ARXIV》 *
YONG FANG等: "Detecting Cyber Threat Event from Twitter Using IDCNN and BiLSTM", 《APPLIED SCIENCES》 *
崔莹: "基于相似义原和依存句法的政外领域事件抽取方法", 《计算机工程与科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580738A (en) * 2022-03-03 2022-06-03 厦门大学 Social media crisis event prediction method and system
TWI822388B (en) * 2022-10-12 2023-11-11 財團法人資訊工業策進會 Labeling method for information security protection detection rules and tactic, technique and procedure labeling device for the same

Similar Documents

Publication Publication Date Title
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
CN106503055B (en) A kind of generation method from structured text to iamge description
CN104598535B (en) A kind of event extraction method based on maximum entropy
CN106815293A (en) System and method for constructing knowledge graph for information analysis
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN106156365A (en) A kind of generation method and device of knowledge mapping
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN111597803B (en) Element extraction method and device, electronic equipment and storage medium
CN106570180A (en) Artificial intelligence based voice searching method and device
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
Kausar et al. ProSOUL: a framework to identify propaganda from online Urdu content
CN113886524A (en) Network security threat event extraction method based on short text
CN107943514A (en) The method for digging and system of core code element in a kind of software document
CN112257441A (en) Named entity identification enhancement method based on counterfactual generation
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN105389303B (en) A kind of automatic fusion method of heterologous corpus
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN108681532B (en) Sentiment analysis method for Chinese microblog
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
Santosh et al. Deconfounding legal judgment prediction for European court of human rights cases towards better alignment with experts
CN106095758B (en) A kind of literary works guess method of word-based vector model
CN109002561A (en) Automatic document classification method, system and medium based on sample keyword learning
CN110362828B (en) Network information risk identification method and system
Augenstein Joint information extraction from the web using linked data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220104