CN116450844A

CN116450844A - Threat information entity relation extraction method for unstructured data

Info

Publication number: CN116450844A
Application number: CN202310323400.5A
Authority: CN
Inventors: 袁陈翔; 朱小龙
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-07-18

Abstract

The invention relates to the field of threat information named entity identification, in particular to a threat information entity relation extraction method oriented to unstructured data, which accurately extracts network threat information entity relations in unstructured text based on a threat information named entity identification method of data enhancement and BERT and a threat information entity relation extraction method fused with multi-element entity information. According to the method, the number of entities of loopholes, domain names and IP is increased, sample diversity of attack organizations and malicious software entities is increased, sentences containing the entities of the type to be enhanced are searched to serve as template sentences, the same type of entities in a knowledge base are filled into the template sentences to generate new sentences containing the entities of the specific type, and the newly generated sentences are added into a training set to achieve data enhancement, so that semantic accuracy is improved. The invention fuses entity semantic information and entity boundary information, and adds entity type information into BERT sentence vectors to help the model to better classify the relationship.

Description

Threat information entity relation extraction method for unstructured data

Technical Field

The invention relates to the field of threat information named entity identification, in particular to a threat information entity relation extraction method for unstructured data.

Background

In recent years, with the rapid development of computer networks and communication technologies, the internet is increasingly connected with daily life of human beings. The Internet continuously assists the middle and small enterprises in China to digitally transform, so that the development of digital economy is promoted, and the digital bonus is promoted to the masses. The internet is widely applied in the fields of intelligent manufacturing, intelligent transportation, electronic government affairs and the like, and the influence on various industries is increased year by year.

At the same time, internet network security is facing more serious challenges. According to the general overview of Internet network security situation in China in 2020 issued by CNCERT of the national Internet Emergency center, the number of malicious program samples captured all over 4200 ten thousand in 2020, and the average daily propagation times over 482 ten thousand. According to the statistics of the attack target IP addresses, about 5541 IP addresses attacked by malicious programs in China account for 14.2% of the total IP addresses in China. The number of security vulnerabilities recorded by a national information security vulnerability sharing platform (CNVD) reaches the historical new height in the same year: 20704, the same ratio increases by 27.9%. Wherein, the number of the loopholes on the zero day is 8902, the loopholes on the zero day account for 43.0 percent, and the loopholes on the zero day are increased by 56.0 percent in the same ratio. The data prove that the network security situation of China is not optimistic.

In addition, distributed denial of service (DDoS) attacks have a significant trend in recent years as a common means of attack that is difficult to protect against. In 2020, the large-flow distributed denial of service attack is carried out 10.4 times in China, the daily occurrence amount is 285 days, and the same ratio is increased by 29.5%. With the development of industrial internet of things technology, more and more industrial equipment is connected to the internet, and 142 high-risk holes related to an important industry (electric power, petroleum and natural gas and rail transit) networking monitoring management system are commonly found in 2020, so that the influence of the high-risk holes is not negligible.

Under the current network environment, although traditional security technologies such as firewalls, antivirus software, intrusion detection systems and the like are widely applied and achieve a certain effect, the traditional security technologies also have some defects when facing 'zero day' vulnerability attacks and Advanced Persistent Threat (APT) attacks. People begin to direct their eyes to threat intelligence in search of new ideas to solve the network security problem. Threat intelligence is evidence-based knowledge, including context, mechanisms, labels, meanings, and actionable advice, that relates to threats or hazards in an asset's face of existing or incumbent threats or hazards, and that can be used by an asset-related entity to provide information support for responding to the threat or hazard or for processing decisions.

In recent years, threat intelligence driven network security active defense modes are attracting attention from academia and industry as new development directions, and many security organizations and security manufacturers begin to research and apply threat intelligence and provide related services accordingly. By actively collecting, refining and analyzing the data, threat information can integrate the network security event information which has occurred or is occurring, so that the network threat can be found in advance, and further measures are taken to achieve the effect of preventing or reducing loss and harm in advance.

However, the development of threat intelligence still faces many challenges. The advanced netherlands government advanced network threat information analyst Kris osthoek in the international journal of intelligence and anti-intelligence written text indicates that the development and utilization of threat information has the following difficulties: first, threat intelligence lacks methodology. Most threat intelligence analysis is driven by the input of alarms and log data rather than predetermined methods or assumptions. The lack of methodology makes it difficult for an enterprise to analyze the correlation of a large number of collapse indicators (Indicators of Compromise, IOC) generated daily to a particular threat environment. Second, the sharing of threat intelligence is only verbally. For trust and benefit reasons, a small portion of structured and semi-structured threat intelligence is shared only inside businesses and organizations. In contrast, most threat intelligence remains shared in an unstructured manner over the internet. Finally, the threat intelligence field does not have a universal naming convention, and great difficulty is brought to entity identification and attribution. For marketing purposes, security vendors and threat intelligence providers are enthusiastic to place various names for various threat organizations. For example, the same russian threat organization has multiple names: APT28, face Bear, sofacy, setnit, STRONTIUM, brown Storm. This results in an increase in the number of entities to be identified and an increase in difficulty in identification. Therefore, the development and utilization of unstructured threat intelligence using computer-related techniques is of practical significance.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a threat information entity relation extraction method for unstructured data, which is based on a threat information naming entity identification method for data enhancement and BERT and a threat information entity relation extraction method for fusing multiple entity information to accurately extract the network threat information entity relation in unstructured text.

In order to achieve the above object, the present invention adopts the following technical scheme:

a threat intelligence entity relation extraction method for unstructured data comprises the following three parts:

1) Threat intelligence entity extraction, comprising the steps of:

s1: defining a relationship between threat entity types and threat information entities based on the STIX threat information criteria;

s2: constructing an NER original annotation data set and a vocabulary knowledge base in the threat information field;

s3: searching sentences containing entities of the type to be enhanced in the original data set as template sentences, filling the same type entities in the vocabulary knowledge base of the threat information field into the template sentences to generate new sentences containing the entities of the specific type, and adding the newly generated sentences into the NER original data set;

s4: filling a template sentence: converting the template sentences into BIO labeling modes, taking a labeling result and a threat information field vocabulary knowledge base as input, generating and outputting the sentences filled with the templates through a template sentence filling algorithm, and forming an enhanced data set by the output sentences;

s5: performing entity extraction on sentences in the enhanced data set output in the step S4 by utilizing a BERT-BiLSTM-CRF model, wherein the BERT layer is responsible for dynamically generating word vectors for each input word according to the context of the input word, and the generated word vector sequence is used as the input of the BiLSTM layer; the BiLSTM layer is responsible for encoding the time relation of the input sequence and outputting a hidden state sequence; the CRF layer decodes the hidden state sequence to obtain a tag sequence corresponding to the sentence, and the obtained tag sequence is the entity type;

2) After the entity extraction is completed, threat information relation extraction is carried out, and the method comprises the following steps:

p1: extracting sentences containing entity relations from the original annotation data set;

p2: extracting threat information entity relation: extracting sentences by taking the threat information text strings in the original annotation data set, the entity list obtained in the entity extraction step and the relation list defined in the step S1 as inputs, outputting a start index and an end index of each sentence, head entity information and tail entity information, and extracting threat information entity relation by utilizing the output information;

3) After the entity extraction and the relation extraction are all completed, the extracted entity and relation information are input into a graph database to construct a threat intelligence knowledge graph, wherein the graph database uses a Neo4j database.

Further, the entity types in the step S1 include 13 classes, namely thread Actor, campaign, malware, technique, tool, identity, location, industry, vulnerability, trace of Action, URL, domain, IP; the relation among threat intelligence entities is 7, namely use, attack, source, similarity, same, possession and response.

Further, the original data set in the step S2 is derived from an unstructured APT report, and the APT report text is manually marked to obtain an original marked data set; and carrying out data enhancement on the original annotation data set through a template sentence filling algorithm to obtain an enhanced data set.

Further, the template sentence filling algorithm in the step S4 specifically includes the following steps:

s4.1, converting sentences in the training set into BIO labeling modes;

s4.2, taking the BIO labeling result of the template sentence and the vocabulary knowledge base in the threat information field as inputs, and respectively acquiring the word and the label of each line in the BIO labeling result of the template sentence;

s4.3, if the label is O, the word and the label are spliced and then stored in a list; if the label is not O, obtaining a domain vocabulary from the knowledge base, and judging that the domain vocabulary consists of a plurality of words;

s4.4, if the domain vocabulary consists of one word, the domain vocabulary is spliced with the corresponding tag and then stored in the list, and if the domain vocabulary consists of a plurality of words, the first word of the domain vocabulary is spliced with the B-tag and then stored in the list, and the second and subsequent words of the domain vocabulary are spliced with the I-tag and then stored in the list; finally, a sentence list composed of sentences generated by the filling template is returned and output.

Further, the step S5 specifically includes the following:

s5.1 inputting the sentence generated in the S4 into the BERT, and dynamically generating word vectors for each input word by the BERT according to the context of the word, wherein the word vectors are used for representing semantic information of the word;

s5.2, inputting the word vector generated in the S5.1 into a bidirectional LSTM coder, and encoding the time sequence relation of the input sequence from two directions (forward direction and backward direction) by the LSTM coder and outputting a hidden state sequence;

s5.3 decodes the hidden state sequence output in S5.2 using the CRF conditional random field and outputs a tag sequence corresponding to the sentence, and for the context encoding module of the output sequence r= { R1, R2, …, rn }, the matching score calculation method given for the input and output is expressed as:

wherein P represents the feature matrix of the BiLSTM layer, A represents the state transition score matrix of the CRF layer,representing the slave y _i Tag to y _i+1 A conversion score of the tag;

s5.4, carrying out softmax on all possible tag sequences of the input sequence R to obtain a predicted tag y, wherein the obtained predicted tag is the corresponding entity type, and the definition is expressed as follows:

s5.5 in model training, the maximum log-likelihood estimation is used to obtain a loss function, the specific procedure is expressed as:

and finally mapping the prediction label obtained in the step S5.4 to the entity type and outputting.

Further, the calculation unit of the LSTM encoder has three gate structures: an input gate, a forget gate, and an output gate;

the specific calculation process is as follows:

f _t ＝σ(W _f *[c _t-1 ，h _t-1 ，x _t ]+b _f )

i _t ＝σ(W _i *[c _t-1 ，h _t-1 ，x _t ]+b _i )

o _t ＝σ(W _o *[c _t-1 ，h _t-1 ，x _t ]+b _o )

wherein (f) _t ，i _t ，o _t ，C _t ) Respectively representing the states of forget gate, input gate, output gate and unit (W) _f ，W _i ，W _o ，W _c ) And (b) _f ，b _i ，b _o ，b _c ) Respectively representing a forgetting gate, an input gate, a weight matrix and a bias vector of an output gate and a memory unit; x is x _t And h _t An input vector and a hidden layer vector at the time t are represented; sigma and tanh are activation functions.

Further, the step P1 includes the following specific contents:

p1.1, inputting a full text character string, an entity list and a relation list;

p1.2, for each record in the relation list, acquiring two entities related to the relation from the entity list;

p1.3, setting a left pointer as a start index of a front entity in the two entities, and setting a right pointer as an end index of a rear entity in the two entities;

p1.4 enters a loop until the key word break is used, if the left pointer is greater than zero and points to an English sentence period, judging whether the left pointer to the end of the character string contains the end characteristics of the English sentence, and if so, the left pointer index is the initial index of the target sentence; similarly, the ending index of the target sentence can be obtained;

and P1.5, returning and outputting a start index, an end index, head entity information and tail entity information of the target sentence.

Further, the step P2 includes the following specific contents:

p2.1 fuses the entity semantic information and entity boundary information output by P1.5, so that the boundary information and type information of the entity are embodied in labels on two sides of the entity;

p2.2 given a sentence s containing entities e1 and e2, input BERT, output vector H;

p2.3 averaging the word vectors of each word constituting an entity to obtain a characterization vector of the entity, sending the characterization vectors of the two entities to the tanh activation function and a full connection layer, and outputting the characterization vectors respectively marked as H' ₁ And H' ₂ The calculation process is expressed as:

wherein the vector H _i To H _j Word vector representing entity e1, vector H _x To H _y Word vector representing entity e2, W ₁ ＝W ₂ ∈R ^d*d ,b ₁ ＝b ₂ D is the BERT hidden state vector size;

p2.4 sends the sentence vector generated by the CLS into the tanh activation function and a full connection layer, which is expressed as:

H′ ₀ ＝W ₀ [tanh(H ₀ )]+b ₀

wherein W is ₀ ∈R ^d*d B0 is a bias vector, d is the BERT hidden state vector size;

the CLS represents classification, and is a mark of the BERT model for text classification tasks; the method comprises the steps of obtaining sentence-level information representation through a self-attention mechanism, wherein a corresponding vector is a sentence vector containing sentence semantic information;

p2.5 will H' ₀ ,H′ ₁ And H' ₂ After splicing, sending the output vector P into a full connection layer and a softMax layer, wherein the process is expressed as follows:

H″＝W ₃ [concat(H′ ₀ ,H′ ₁ ,H′ ₂ )]+ ₃

p＝softmax(H″)

wherein,,l is the number of relationship categories, b ₃ Is a deviation vector, and the output vector P epsilon R ^L ；R ^L Is a set of relationships;

and P2.6 is the output relation corresponding to the item with the largest P median.

The invention has the beneficial effects that:

1. compared with the existing data enhancement method, the method increases the sample diversity of the two types of entities, namely the vulnerability, the domain name and the IP, the attack organization and the malicious software by increasing the entity number of the 3 types of entities, searches sentences containing the entity of the type to be enhanced as template sentences, fills the entity of the same type into the template sentences in the knowledge base to generate new sentences containing the entity of the specific type, and adds the newly generated sentences into the training set to realize data enhancement so as to improve the semantic accuracy.

2. The method adds entity type information into the BERT sentence vector on the basis of fusing entity semantic information and entity boundary information in the threat information entity relation extraction task so as to help the model to better classify the relation.

Drawings

FIG. 1 is a full flow chart of threat intelligence entity relationship extraction in accordance with the present invention;

FIG. 2 is a flow chart of threat intelligence naming entity identification in accordance with the present invention;

FIG. 3 is a diagram of a threat intelligence naming entity identification model of the present invention;

FIG. 4 is a threat intelligence ontology diagram of the present invention;

FIG. 5 is a flow chart of threat intelligence relationship extraction in accordance with the present invention;

FIG. 6 is a diagram of a threat intelligence relationship extraction model of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the present invention is a threat intelligence entity relationship extraction method for unstructured data, which includes three parts:

1) Threat information entity identification, the overall threat information entity identification flow is shown in fig. 2, the threat information identification model is shown in fig. 3, and the method comprises the following steps:

s1: defining a relationship between threat entity types and threat intelligence entities based on the STIX threat intelligence criteria, the defined entities and relationships being shown in fig. 4;

s3: searching sentences containing entities of the type to be enhanced in the original annotation data set to serve as template sentences, filling the same type entities in the vocabulary knowledge base of the threat information field into the template sentences to generate new sentences containing the entities of the specific type, and adding the newly generated sentences into the NER original annotation data set;

as a preferred embodiment of the present invention, the template sentences described above, although reusable, still require control of the number of occurrences of each template sentence in the enhancement data to prevent overfitting. The number of occurrences of each template sentence in the enhancement data, N, is expressed as:

2) After the entity identification is completed, threat information relation extraction is carried out, the whole threat information relation extraction flow is shown in fig. 5, a threat information extraction model is shown in fig. 6, and the method comprises the following steps:

p1: extracting sentences containing entity relations from the original threat information text;

When applied to threat intelligence ontologies in the knowledge graph field, threat entities in part of the STIX2.1 may not be applicable. Whether the 18 threat entities in the STIX2.1 are suitable for the threat information ontology proposed by the present invention needs to be chosen and removed.

For Indicator, infrastructure and ObservedData. An Indicator represents a collapse Indicator (IOC) during an attack; the ObservedData represents data which can be observed in the attack process, and the APT report is mainly based on a collapse index; infrastructure refers to a physical or virtual resource involved in an attack activity, and IP or domain name is typically used in APT reporting to describe the resources used by APT organizations, such as C2 servers. The three have semantic overlap and all contain related content of the collapse index, so that the three are thinned into URL, domain name and IP and are classified into the InfraRed category. The next are thread Actor, campaign and introduction Set. Campaign is part of an introduction Set, and APT reports typically describe a certain Campaign or part of a phase of a certain Campaign, thus merging both into Campaign. The thread Actor is used as the principal angle of Campaign, has a participation relationship with Campaign, and is therefore reserved.

Further, consider the attach Pattern, malware Analysis, and Tool. For Malware and Malware Analysis, there are specialized Malware Analysis reports in the threat intelligence field, but APT reports typically contain only basic descriptions of Malware names and functions, and do not include Analysis of the Malware, thus combining Malware and Malware Analysis into Malware. While the three types of attach Pattern, malware and Tool belong to the category of TTP, in the APT report, attach Pattern will be refined to a description of the technology and corresponding Attack procedure used by the APT organization, so the technology is used instead of attach Pattern, and Malware and Tool are reserved.

Further, consider the Course of actions, vulnerability, identity and Location. The plurseofaction describes methods and measures to deal with TTP and certain security vulnerabilities; the Location is common information in the APT report and also is important information forming a threat information knowledge graph, and the Location comprises source sites and attacked sites of entities such as APT organization, malicious software and the like; the Identity can appear in the APT report in different identities, such as APT organization, attacked organization, etc. Thus, the class 4 threat entity described above is retained. Furthermore, industry (Industry) is a target of attacks by APT organizations, often occurring with sites, thus adding Industry as a threat entity.

Further, grouping, report, note and opiion are considered. The relationship between Grouping, note and Opinion and other STIX objects is not explicitly defined in STIX2.1, and the three are more suitable for scenes such as threat information exchange or collaborative threat analysis. These three types of threat entities are not necessary for constructing threat intelligence knowledge maps. Report is used in the present invention in the form of an APT Report instance as a data source and is thus not separately listed as a class of threat entities. Thus, the above class 4 threat entities are not reserved.

As a preferred embodiment of the present invention, the entity types in step S1 include 13 classes, namely thread Actor, campaign, malware, technique, tool, identity, location, industry, vulnerability, course of Action, URL, domain, IP; the relation among threat intelligence entities is 7, namely use, attack, source, similarity, same, possession and response.

As a preferred embodiment of the present invention, the original data set in step S2 is derived from an unstructured APT report, and the APT report text is manually annotated to obtain an original annotated data set; and carrying out data enhancement on the original annotation data set through a template sentence filling algorithm to obtain an enhanced data set.

As a preferred embodiment of the invention, the threat information field vocabulary knowledge base consists of threat information entity vocabularies, and 5 knowledge bases are constructed in total, which cover entity types to be enhanced, namely attack organizations, malicious software, loopholes, domain names and IP.

As a preferred embodiment of the present invention, the template sentence filling algorithm of step S4 specifically includes the steps of:

s4.1, converting sentences in the training set into BIO labeling modes;

As a preferred embodiment of the present invention, step S5 specifically includes the following:

the [ CLS ] in fig. 3 and 6 represents a classification, which is a label used by the BERT model for text classification tasks. [ CLS ] obtains sentence-level information representation by a self-attention mechanism, and its corresponding vector is a sentence vector containing sentence semantic information. The addition of [ CLS ] in the invention is the BERT input format requirement and is irrelevant to the named entity recognition task.

As a preferred embodiment of the present invention, the calculation unit of the LSTM encoder has three gate structures: an input gate, a forget gate, and an output gate;

the specific calculation process is as follows:

f _t ＝σ(W _f *[c _t-1 ，h _t-1 ，x _t ]+b _f )

i _t ＝σ(W _i *[c _t-1 ，h _t-1 ，x _t ]+b _i )

o _t ＝σ(W _o *[c _t-1 ，h _t-1 ，x _t ]+b _o )

As a preferred embodiment of the present invention, step P1 includes the following specific matters:

As a preferred embodiment of the present invention, step P2 includes the following specific matters:

the labels are exemplified as follows: e.g., [ E11:att ] represents the left boundary of entity 1, and the entity belongs to the attacker type; [ E12:att ] represents the right boundary of entity 1, and the entity belongs to the attacker type;

H′ ₀ ＝W ₀ [tanh(H ₀ )]+b ₀

wherein W is ₀ ∈R ^d*d B0 is the bias vector and d isThe BERT hidden state vector size;

H″＝W ₃ [concat(H′ ₀ ,H′ ₁ ,H′ ₂ )]+ ₃

p＝softmax(H″)

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. The threat intelligence entity relation extraction method for unstructured data is characterized by comprising the following three parts:

1) Threat intelligence entity extraction, comprising the steps of:

s5: performing entity extraction on sentences in the enhanced data set output in the step S4 by using a BERT+BiLSTM+CRF model, wherein the BERT layer is responsible for dynamically generating word vectors for each input word according to the context of the input word, and the generated word vector sequence is used as input of the BiLSTM layer; the BiLSTM layer is responsible for encoding the time relation of the input sequence and outputting a hidden state sequence; the CRF layer decodes the hidden state sequence to obtain a tag sequence corresponding to the sentence, and the obtained tag sequence is the entity type;

2. The method for extracting Threat intelligence entity relationship for unstructured data according to claim 1, wherein the entity types in step S1 comprise 13 types, namely thread Actor, campaign, malware,

Technique, tool, identity, location, industry, vulnerability, plurse of Action, URL, domain, IP; the relation among threat intelligence entities is 7, namely use, attack, source, similarity, same, possession and response.

3. The threat intelligence entity relationship extraction method for unstructured data according to claim 1, wherein the original data set in step S2 is derived from an unstructured APT report, and the APT report text is manually labeled to obtain an original labeled data set; and carrying out data enhancement on the original annotation data set through a template sentence filling algorithm to obtain an enhanced data set.

4. The method for extracting threat intelligence entity relationship for unstructured data according to claim 1, wherein the template sentence filling algorithm in step S4 specifically comprises the following steps:

s4.1, converting sentences in the training set into BIO labeling modes;

5. The method for extracting threat intelligence entity relationship towards unstructured data according to claim 1, wherein the step S5 specifically comprises the following steps:

s5.3 decodes the hidden state sequence output in S5.2 using the CRF conditional random field and outputs a tag sequence corresponding to the sentence, and for the context encoding module of the output sequence r= { R1, R2, &..rn }, the matching score calculation method given to the input and output is expressed as:

6. The method for extracting threat intelligence entity relationship for unstructured data according to claim 5, wherein the computing unit of the LSTM encoder has three gate structures: an input gate, a forget gate, and an output gate;

the specific calculation process is as follows:

f _t ＝σ(W _f *[c _t-1 ，h _t-1 ，x _t ]+b _f )

i _t ＝σ(W _i *[c _t-1 ，h _t-1 ，x _t ]+b _i )

o _t ＝σ(W _o *[c _t-1 ，h _t-1 ，x _t ]+b _o )

7. The method for extracting threat intelligence entity relationship towards unstructured data according to claim 1, wherein the step P1 comprises the following specific contents:

8. The method for extracting threat intelligence entity relationship towards unstructured data according to claim 1, wherein said step P2 comprises the following specific contents:

wherein the vector H _i To H _j Word vector representing entity e1, vector H _x To H _y Word vector representing entity e2, W ₁ ＝W ₂ ∈R ^d*d ，b ₁ ＝b ₂ D is the BERT hidden state vector size;

H′ ₀ ＝W ₀ [tanh(H ₀ )]+b ₀

p2.5 will H' ₀ ，H′ ₁ And H' ₂ After splicing, sending the output vector P into a full connection layer and a softMax layer, wherein the process is expressed as follows:

H″＝W ₃ [concat(H′ ₀ ，H′ ₁ ，H′ ₂ )]+b ₃

p＝softmax(H″)

wherein W is ₃ ∈R ^L*3d L is the number of relationship categories, b ₃ Is a deviation vector, and the output vector P epsilon R ^L ；R ^L Is a set of relationships;