CN114330322A

CN114330322A - Threat information extraction method based on deep learning

Info

Publication number: CN114330322A
Application number: CN202210006117.5A
Authority: CN
Inventors: 李小勇; 左峻嘉; 高雅丽; 兰天
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-04-12

Abstract

The invention discloses a threat information extraction method based on deep learning, which comprises the following steps: s1, information acquisition: collecting APT reports, calling a Request library to design a web crawler for analyzing the webpage structures of different sources to finish the acquisition of unstructured information texts, and designing a bloom filter to realize the deduplication processing of url; s2, preprocessing: screening input data according to the article length and the keyword density, and carrying out entity relation labeling on the screened APT report by adopting YEEDA; s3, entity relationship extraction: and extracting valuable entity relation triples from the preprocessed unstructured APT report. According to the threat information extraction method, the deep neural network model is adjusted, a new sequence labeling method and an entity relation extraction rule are provided, the problems that a propagation error exists in a current threat information entity relation extraction system and the extraction accuracy of the model to an overlapping relation entity is low are solved, and meanwhile, the details of large-scale threat information data set construction and preprocessing are provided.

Description

Threat information extraction method based on deep learning

Technical Field

The invention relates to the technical field of filters, in particular to a threat intelligence information extraction method based on deep learning.

Background

According to the definition provided by the international authoritative IT consultant Gartner, threat intelligence is evidence-based knowledge about existing or emerging threats faced by IT or information assets, including scenarios, mechanisms, indicators, inferences, and actionable recommendations, which may provide decision-making basis for threat response. The threat intelligence can provide tactical level, operation level and strategic level intelligence for the protecting party by providing standard knowledge definition, tagged portrait technology and a scene application method, thereby effectively helping the vulnerability detection management of the protecting party and solving the problem of asymmetric information of the attacking and defending party. As can be seen from the above description, the threat intelligence can be used as a security expert to provide a feasibility suggestion when the user is threatened or attacked, and has an important value of intelligent defense.

In recent years, the number and complexity of cyber attacks has increased explosively. To defend against advanced persistent threats, security manufacturers monitor system and software vulnerabilities in real time and generate a large amount of warning information. However, these unstructured data lack correlation and are difficult to directly utilize. Therefore, obtaining effective security information from mass data has become an important issue in the field of network security. The information extraction technology comprises named entity identification and entity and relationship extraction, and can effectively convert the unstructured network threat report into structured information so as to improve the utilization rate.

Entity relationship extraction has been a classic and challenging task, and has made many staged breakthroughs under the research development over the last two decades. At present, the research results of entity relation extraction are mainly applied to the fields of knowledge map construction, automatic question answering systems, machine translation, massive text summaries and the like. From early pattern matching-based relationship extraction to later machine learning-based relationship extraction, entity relationship extraction has gained wide attention. At present, natural language processing also makes breakthrough progress along with artificial intelligence trend desk based on deep learning. The entity relation extraction under deep learning effectively improves the self defects of the traditional marking tool, obtains good effect and becomes a hotspot and a key of research in recent years. However, entity relationship extraction still faces many challenges so far, such as complexity of entity semantic relationship, ambiguity of entity relationship between sentences, insufficient data scale, conflict of model learning ability and the like, which all restrict development of entity relationship extraction.

The triple extraction technique in cyber threat intelligence is still in the initiative compared to the successful application of information extraction in medical, financial and general fields. Because entities related to the network security field comprise specific categories such as attack organizations, attack methods, vulnerabilities, malicious software and the like, the purpose of relationship extraction is to match the specific entities related to network threat intelligence, and the existing information extraction model cannot be directly applied to the extraction of the entities and the relationships in the network security field.

At present, the deep learning supervised entity relationship extraction oriented to network threat intelligence can be divided into: 1) a pipeline method; 2) a joint learning method. Both methods are based on Convolutional Neural Network (CNN) and Long Short Term Memory Network (LSTM) for extended optimization, as shown in fig. 1.

In the pipeline method, entities related to network security are obtained by named entity identification, and then the relationship between candidate entity pairs is predicted according to the existing entity relationship. The expansion based on the CNN model comprises the steps of adding category ranking information, a dependency analysis tree and an attention mechanism on the basis of the CNN; extensions based on the LSTM model include adding the Shortest Dependent Path (SDP) over the LSTM or combining the LSTM with the CNN. However, the pipeline method has the problems of error accumulation and propagation, neglect of relation dependence among subtasks, generation of redundant entities and the like, and the semantic relation between entity identification and relation extraction is not fully utilized, so that the extraction accuracy is reduced.

In the joint learning method, according to different modeling objects, the method can be divided into two sub-methods of parameter sharing and sequence marking: the parameter sharing method uses a bidirectional Long Short-Term Memory network (Bi-directional Long Short-Term Memory, BilSTM) and expands optimization such as CNN and attention mechanism; the sequence marking method solves the problem of redundant entities in the pipeline model by using an end-to-end model of a new marking strategy. With the development of the deep neural network, the end-to-end entity relationship joint extraction model is widely applied, but the extraction effect on the overlapping relationship entities is poor.

Disclosure of Invention

Based on the above, the invention aims to provide a threat information extraction method based on deep learning, which solves the problem of propagation errors existing in a pipeline method; and secondly, the problem of low extraction accuracy of the overlapping relation entity pair is solved.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a threat information extraction method based on deep learning, which comprises the following steps:

s1, information acquisition: collecting an Advanced Persistent Threat (APT) report, calling a Request library to design a web crawler for analyzing the structure of a webpage from different sources to finish the acquisition of unstructured information texts, carrying out MD5(Message-Digest Algorithm) information Digest Algorithm encryption on Uniform Resource Locators (URLs), generating a plurality of hash values based on a plurality of haystack functions, mapping the hash values to a bloom filter to finish the deduplication processing of the URLs;

s2, preprocessing: screening input data according to article length and keyword density, and carrying out entity relation labeling on the screened APT report;

s3, entity relationship extraction: and extracting entity relation triples which accord with a preset type from the preprocessed unstructured APT report.

Further, in step S2, the input data is filtered according to the article length and the keyword density, and reports with text length less than 500 words and keyword density less than 0.05 are removed.

Further, in step S2, performing entity relationship labeling on the screened APT report by using YEDDA, where the labeling is divided into three parts, namely, an entity boundary, a relationship category, and an entity role; for entity boundaries, "BIEOS" is used to represent the position information of a word in an entity: "B" indicates that the word is at the beginning of the entity; "I" indicates that the word is located in the middle of the entity; "E" indicates that the word is at the end of the entity; "S" indicates that the word is a single entity; "O" means that the word does not belong to any entity; based on the CTI corpus, all entity roles are divided into seven classes, namely Attack Organization (ORG), Location (LOC), Software (Software, SW), Malware (Malware, MAL), Vulnerability (VUL), Attack Method (attach Method, MEH), and Malicious File (MF); entity relationships are divided into six classes: from (comes-from), use (uses), vulnerability (has-virtualization), software (has-product), exploit malicious file (uses-file), related-to.

Further, the specific process of extracting the entity relationship in step S3 is as follows:

s301, preprocessing an unstructured event intelligence text;

s302, converting each word in the text into a one-dimensional vector by querying a word vector table, inputting a deep Bidirectional pre-training language understanding model (BERT) pre-training language model, and outputting the model by vector representation after fusion of full-text semantic information corresponding to each word is input;

s303, inputting the word vector of the text into an Attention _ BilSTM _ CRF training model, and outputting a globally optimal labeling sequence.

Further, in the BERT pre-training language model in step S302, the activation function is GeLU, the dimensionality is 768 dimensions, and the number of hidden layers is fine-tuned to 10 layers.

Further, step S302 inputs the word vector from 768 dimensions to 200 dimensions into the encoding layer of the Attention _ BiLSTM _ CRF training model by means of matrix mapping.

Further, the Attention _ BilsTM _ CRF training model is divided into a layer (coding layer), a layer (Attention layer), and a layer (decoding layer).

Further, the process of the BilSTM layer is as follows: taking the word vector sequence obtained by the embedding layer as the input of the BilSTM, and splicing the hidden state sequence output by the forward LSTM and the hidden state sequence output by the reverse LSTM to obtain a complete hidden state sequence; and then mapping the hidden state sequence to a dimension k, wherein k is the label number of the label set, thereby obtaining the automatically extracted features.

Further, the process of the Attention layer is as follows: calculating the degree of similarity e between sequence elements_ijI.e. the influence of word j on word i in the input sequence: attention layer weight matrix alpha_ijRepresents the attention weight of the word j relative to the word i in the text and accurately captures the influence between the words; then, a sequence h is calculated using the weighting coefficients_iObtaining a sequence vectorization representation S_t(ii) a Finally, the S is_tCoded output of tiled BiSLTM layers

And adding an activation function pair for nonlinear transformation to obtain a weight matrix W influencing subsequent label classification_tAs input to the CRF layer, thereby achieving a focus of attention in model training.

Further, the flow of the CRF layer is as follows: converting the input sequence X to (X)₁，x₂，...，x_n) The corresponding tag sequence y ═ y (y)₁，y₂，...，y_n) The score calculation was performed and the total score of the tag sequences was as follows:

where T is a transition matrix representing the transition score between tags, T_i，jRepresenting the probability of transition from tag i to tag j,

representing an input sequence word x_iProbability of being classified to label j.

Compared with the prior art, the invention has the beneficial effects that:

according to the threat information extraction method based on deep learning, the problems that a propagation error exists in a current threat information entity relation extraction system and the extraction accuracy of a model to an entity in an overlapping relation is low are solved by adjusting a deep neural network model and providing a new sequence marking method and an entity relation extraction rule, and meanwhile, details of large-scale threat information data set construction and preprocessing are provided.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings.

FIG. 1 is a flow chart of a conventional threat intelligence information extraction;

FIG. 2 is a flow chart of extracting information entity relationship according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making an invasive task, are within the scope of the present invention.

The invention provides a threat information extraction method based on deep learning, which comprises the steps of information acquisition, information preprocessing and entity relation extraction.

Step 1, information collection

The system collects relevant APT reports from Symantec (Symantec), Fireeye (Fireeye), threat report (Threatpost) and other sources, and calls a Request library to analyze the structural design web crawler of the web page of different sources to complete the collection of unstructured information texts considering that the information is embedded in the web page in the form of unstructured text information. Considering that there may be repetition of text and update of different intelligence sources, we design a bloom filter to implement deduplication processing of Uniform Resource Locators (URLs). Specifically, the URL is encrypted by an MD5(Message-Digest Algorithm) information Digest Algorithm, and a plurality of hash values are generated based on a plurality of hash functions and mapped to a bloom filter to complete the deduplication processing of the URL.

Step 2, information preprocessing

The step is to carry out preliminary processing on input data, and as news, comments and advertisements which are irrelevant to the network threat intelligence exist, the news, the comments and the advertisements are screened according to the length of an article and the density of keywords:

the length of the article: the APT reports are generally found to be longer in length based on practical experience and observation, and range from 600-3000. They need to describe in detail specific attack events including intent of attack, attack pattern, etc. For other texts, such as general advertisements and comments, there are usually not so many words, the length range is 100-300 words, and there is no CTI-related keyword, so as to improve the quality of threat intelligence data set, and finally, we reject reports with text length less than 500 words.

Keyword density: for APT reports, they are generally considered to contain more network security keys, such as vulnerability numbers, hash values, or IPs. Therefore, we count the number of keywords in each text and calculate its density, i.e., the percentage of keywords in the text to the total number of words in the report. Finally, i culled reports with keyword density less than 0.05.

In addition, YEDDA is adopted to label the entity relationship of the screened APT report, wherein the label is divided into three parts, namely an entity boundary, a relationship category and an entity role. For entity boundaries, "BIEOS" is used herein to represent the location information of a word in an entity: "B" indicates that the word is at the beginning of the entity. "I" indicates that the word is located in the middle of the entity. "E" indicates that the word is at the end of the entity. "S" indicates that the word is a single entity. "O" means that the word does not belong to any entity. Based on CTI corpus, all entities are herein classified into seven classes, ORG, LOC, SW, MAL, VUL, MEH and MF; the relationships are divided into six categories, come-from, uses, has-vulgaris availability, has-product, uses-file, and related-to. Finally, we selected 846 APT reports and labeled 13865 entities and 7394 relationships.

Step 3, entity relation extraction

The step is to extract entity relation triplets which accord with a preset type from the preprocessed unstructured APT report, and requires that the entity relation contents of the intelligence events in the unstructured intelligence text data are complete and the attack event core expression is accurately identified. The specific flow is shown in fig. 2.

BERT pre-training language model: adopting a BERT pre-training language model for fine tuning, converting each word in the text into a one-dimensional vector by querying a word vector table, and inputting the one-dimensional vector as a model; the output of the model is the vector representation after the full text semantic information corresponding to each word is input. The activation function is GeLU, the dimensionality is 768 dimensions, the number of traditional hidden layers is reduced to 10 layers from 12 layers through fine adjustment, and training time can be shortened on the premise that accuracy is not affected. In addition, because the dimension of the word vector is high, in order to reduce the training time and prevent overfitting, dimension reduction processing is carried out on the word vector, and 768 dimensions are reduced to 200 dimensions by means of matrix mapping and input into a BilSTM layer of the Attention _ BilSTM _ CRF training model.

Attention _ BilSTM _ CRF training model: inputting a word vector of a text as a model; and outputting the labeling sequence which is globally optimal. The model is divided into a BilSTM coding layer, an Attention layer and a CRF decoding layer.

BilsTM layer: the word vector sequence X ═ (X) obtained by embedding the layers herein₁，x₂，...，x_n) Hidden state sequence as input to BilSTM output to Forward LSTM

Hidden state sequence with inverted LSTM output

Splicing to obtain a complete hidden state sequence

Where m is the number of cells of the hidden layer. And then mapping the hidden state sequence to a dimension k, wherein k is the label number of the label set, thereby obtaining the automatically extracted features and recording the features as a matrix p. p is a radical of_ijRepresenting the word x_iThe score to the jth label. The calculation formula is as follows:

o_t＝tanh(W_xox_t+W_hoh_t-1+W_coc_t+b₀)

h_t＝o_ttanh(c_t)

wherein o is_tIndicating the output state, c_tIndicating the state of the cell, b₀Denotes an offset vector, W_xo、W_hoAnd W_coWeight matrices representing the input vector, the hidden layer vector and the cell, respectively. An Attention layer: we first calculate the degree of similarity e between sequence elements_ijI.e. the influence of word j on word i in the input sequence:

wherein the content of the first and second substances,

W_aand U_aRepresenting matrix parameters s during training_i-1Indicating a hidden layer state.

Attention layer weight matrix alpha_ijRepresents the attention weight of word j relative to word i in this text and accurately captures the inter-word spaceThe influence of (c). The calculation is as follows:

then, a sequence h is calculated using the weighting coefficients_iObtaining a sequence vectorization representation S_t. Finally, the S is_tSplicing coded outputs of BiSLTM layers

And adding an activation function pair for nonlinear transformation to obtain a weight matrix W influencing subsequent label classification_tAs input to the CRF layer, thereby achieving a focus of attention in model training. The specific calculation process is as follows:

W_t＝[S_t，h_t]

wherein alpha is_kiIs the probability weight of state node k to position node i.

CRF layer: the input sequence X ═ X (X) is defined herein₁，x₂，...，x_n) The corresponding tag sequence y ═ y (y)₁，y₂，...，y_n) A score calculation is performed, the greater the score, the higher the likelihood that the corresponding tag is true. The labels before and after the label sequence y have transfer scores in the whole, and the higher the score is, the more likely label transfer occurs. The total score for the tag sequences is shown below:

where T is a transition matrix representing the transition score between tags, T_i，jRepresenting the probability of transition from label i to label j.

To representInputting sequence word x_iProbability of being classified to label j.

Matching the triple extraction rules: in order to accurately extract the entity relationship triple, the following rule matching is performed on the output global optimal labeling sequence. The specific implementation is shown in algorithm 1.

Example 1

Constructing a threat intelligence entity relation extraction model according to the following steps:

1. constructing a threat intelligence data set: constructing a web crawler capturing APT report, screening irrelevant network threat intelligence based on article length and keyword density, constructing a high-quality threat intelligence data set, and specifying a labeling strategy to realize entity relationship labeling.

2. Data preprocessing: the language model is pre-trained using fine-tuned BERT. Wherein, the activation function is GeLU, the dimensionality is 768 dimensions, and the number of hidden layer layers is 10.

3. And training a threat intelligence entity relation extraction model for the processed data set.

Extracting the information by using a threat information entity relation extraction model:

step 1: the word vector after full-text semantic information is fused is obtained by data preprocessed by the BERT pre-training language model through an embedding layer;

step 2: and inputting the output of the embedding layer into the deep neural network after dimension reduction processing. The details of the deep neural network are as follows:

a. the structure is as follows: the BilSTM layer dimension is 200 dimensions, and the attention layer dimension is 200 dimensions.

b. Learning rate: 0.001.

c. an optimizer: adam.

d. Training algorithm: a back propagation algorithm.

e、Dropout：Dropout＝0.5。

And step 3: and for the obtained global optimal label sequence, extracting the matching rule provided by the proposed algorithm 1 to obtain an entity relationship triple.

The method solves the problems that the existing threat information entity relation extraction system has propagation errors and the extraction accuracy of the model to the overlapping relation entity is not high.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are described in greater detail with reference to the method embodiments, where relevant, in that they are substantially similar to the method embodiments.

The above-mentioned embodiments are merely specific embodiments of the present application, which are used for illustrating the technical solutions of the present application and not for limiting the same, and the protection scope of the present application is not limited thereto, although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A threat intelligence information extraction method based on deep learning is characterized by comprising the following steps:

s1, information acquisition: collecting APT reports, calling a Request library to different web page structure design web crawlers for analyzing different sources to finish the acquisition of unstructured information texts, carrying out MD5 information abstract algorithm encryption on URLs, generating a plurality of hash values based on a plurality of hayes functions, mapping the hash values to a bloom filter to finish the deduplication processing of the URLs;

2. The method for extracting threat intelligence information based on deep learning of claim 1, wherein in step S2, the input data is screened according to the article length and the keyword density, and reports with text length less than 500 words and keyword density less than 0.05 are rejected.

3. The method for extracting threat intelligence information based on deep learning of claim 1, wherein in step S2, entity relationship labeling is performed on the screened APT report by using YEDDA, wherein the labeling is divided into three parts, namely, an entity boundary, a relationship category and an entity role; for entity boundaries, "BIEOS" is used to represent the position information of a word in an entity: "B" indicates that the word is at the beginning of the entity; "I" indicates that the word is located in the middle of the entity; "E" indicates that the word is at the end of the entity; "S" indicates that the word is a single entity; "O" means that the word does not belong to any entity; based on the CTI corpus, all entity roles are divided into seven classes, namely ORG, LOC, SW, MAL, VUL, MEH and MF; entity relationships are divided into six classes: come-from, uses, has-vulgaris, has-product, uses-file, and related-to.

4. The method for extracting threat intelligence information based on deep learning of claim 1, wherein the specific process of extracting entity relationship in step S3 is as follows:

s301, preprocessing an unstructured event intelligence text;

s302, converting each word in the text into a one-dimensional vector by querying a word vector table, inputting the one-dimensional vector into a BERT pre-training language model, and outputting the model by inputting a vector representation corresponding to each word and fusing full-text semantic information;

5. The method for extracting threat intelligence information based on deep learning of claim 4, wherein in the BERT pre-training language model of step S302, the activation function is GeLU, the dimensionality is 768 dimensions, and the number of hidden layers is 10.

6. The method for extracting threat intelligence information based on deep learning of claim 4, wherein step S302 is implemented by matrix mapping to reduce the word vector from 768 dimensions to 200 dimensions and input the word vector into the Bilstm layer of the Attention _ BilSTM _ CRF training model.

7. The method of claim 4, wherein the Attention _ BilSt _ CRF training model is divided into a Bilstm layer, an Attention layer and a CRF layer.

8. The method for extracting threat intelligence information based on deep learning of claim 7, wherein the flow of the BilSTM layer is as follows: taking the word vector sequence obtained by the embedding layer as the input of the BilSTM, and splicing the hidden state sequence output by the forward LSTM and the hidden state sequence output by the reverse LSTM to obtain a complete hidden state sequence; and then mapping the hidden state sequence to a dimension k, wherein k is the label number of the label set, thereby obtaining the automatically extracted features.

9. The method for extracting threat intelligence information based on deep learning of claim 7, wherein the process of the Attention layer is as follows: calculating the degree of similarity e between sequence elements_ijI.e. the influence of word j on word i in the input sequence: attention layer weight matrix alpha_ijRepresenting the attention weight of the word j relative to the word i in the text and accurately capturing the influence between the words; then, a sequence h is calculated using the weighting coefficients_iObtaining a sequence vectorization representation S_t(ii) a Finally, the S is_tCoded output of tiled BiSLTM layers

10. The method for extracting threat intelligence information based on deep learning of claim 7, wherein the flow of the CRF layer is as follows: converting the input sequence X to (X)₁，x₂，...，x_n) The corresponding tag sequence y ═ y (y)₁，y₂，...，y_n) The score calculation was performed and the total score of the tag sequences was as follows:

where T is a transition matrix representing the transition score between tags, T_i，jRepresenting transfer from tag i to tag jThe probability of the occurrence of the event,