CN116756327A

CN116756327A - Threat information relation extraction method and device based on knowledge inference and electronic equipment

Info

Publication number: CN116756327A
Application number: CN202311053640.4A
Authority: CN
Inventors: 董龙飞; 杨大路; 李衍; 刘广坤
Original assignee: Tianji Youmeng Zhuhai Technology Co ltd
Current assignee: Tianji Youmeng Zhuhai Technology Co ltd
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-09-15
Anticipated expiration: 2043-08-21
Also published as: CN116756327B

Abstract

The invention discloses a threat information relation extraction method and device based on knowledge inference and electronic equipment, and belongs to the technical field of information security. The method comprises the following steps: acquiring a head-tail entity pair of threat information texts of a relation to be extracted; acquiring a relation between a head entity and a tail entity in a head-tail entity pair; constructing an initial relationship diagram according to each head-tail entity pair and the relationship thereof; performing scene classification on threat information texts where all head and tail entities are located, and determining corresponding constraint rules of the relation chain body according to the scene classification; according to the constraint rule of the relation chain ontology, deducing to obtain a direct relation between entity nodes with association relation on the initial relation graph, and establishing the direct relation on the initial relation graph to obtain a final relation graph; and converting the final relation diagram into a storage form of an edge table, obtaining and outputting the expression form of the < head entity, the relation and the tail entity >. The quality and the interpretability of the relation extraction are ensured, and the efficiency of the relation extraction in the threat intelligence field is effectively improved.

Description

Threat information relation extraction method and device based on knowledge inference and electronic equipment

Technical Field

The present invention relates to the field of information security technologies, and in particular, to a method, an apparatus, and an electronic device for extracting threat intelligence relationships based on knowledge inference.

Background

Threat intelligence relationship extraction refers to extracting valuable entities and their relationships from threat intelligence text, such as aggressors, victims, attack behaviors, attack tools, attack targets, etc. These entities and their relationships may form a representative attack pattern for threat detection and threat hunting. Threat intelligence relationship extraction techniques play an important role, such as converting unstructured threat intelligence text into structured data through threat intelligence relationship extraction, thereby facilitating machine understanding and processing. This may increase the availability of threat intelligence, enabling it to be used by more systems and tools. Meanwhile, the operability of threat information can be improved, so that the threat information can be used as a signal for threat detection and threat hunting, and security personnel can be helped to quickly identify and deal with network attacks. For example, more entities and relationships can be extracted from a plurality of threat information sources through threat information relationship extraction, so that a more comprehensive and detailed attack behavior diagram is constructed. This may increase the integrity of threat intelligence so that it can cover various aspects and dimensions of the attack. Meanwhile, the depth of threat information can be improved, so that the threat information can reflect the information such as logic, cause and effect, time and the like of the attack, and security personnel can better understand and analyze the network attack. And then, for example, the threat information relation is extracted, so that the automatic processing of a large amount of threat information texts can be realized, and the labor and time cost are saved. Therefore, the utilization efficiency of threat information can be improved, and the threat information can respond to the change and development of network attacks in time. Meanwhile, the utilization effect of threat information can be improved, so that errors and omission can be reduced, and accuracy and reliability are improved.

Currently, threat intelligence relationship extraction relies primarily on Natural Language Processing (NLP) techniques, such as Named Entity Recognition (NER), semantic Role Labeling (SRL), relationship Extraction (RE), etc., which can identify entities, behaviors, attributes, and semantic relationships between them from unstructured threat intelligence text, thereby forming structured threat intelligence text. However, there are some technical difficulties as follows: threat intelligence text typically contains a large number of terms of art, acronyms, web addresses, etc., and comes from different sources, such as technical reports, blogs, news, etc., so the structure, style, and quality of the text are not the same. The method brings challenges to threat information relation extraction by utilizing an NLP technology, and the operations such as preprocessing, normalization and standardization are required to be performed on the text; in addition, threat intelligence text often describes only some aspects of the attack, and does not provide complete attack procedures and details. Meanwhile, some fuzzy, ambiguous or erroneous information may exist in the threat intelligence text, so that the identification and extraction of entities and relations by utilizing the NLP technology are difficult to accurately perform. Moreover, threat intelligence text is continuously updated and changed with the development of network attacks, so that the latest threat intelligence text needs to be collected, analyzed and utilized in time. The historical threat intelligence text also needs to be updated and corrected to maintain the consistency and accuracy of the data, so that the threat intelligence relation extraction by using the NLP technology has great difficulty.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The first aspect of the invention provides a threat intelligence relationship extraction method based on knowledge inference, comprising the following steps:

acquiring a head-tail entity pair of threat information texts of a relation to be extracted;

acquiring a relation between the head entity and the tail entity in the head-tail entity pair;

constructing an initial relationship diagram according to each head-tail entity pair and the relationship thereof;

performing scene classification on the threat information text where each head and tail entity is located, and determining corresponding constraint rules of the relation chain body according to the scene classification;

according to the relation chain ontology constraint rule, deducing to obtain a direct relation between entity nodes with association relation on the initial relation graph, and establishing the direct relation on the initial relation graph to obtain a final relation graph;

and converting the final relation diagram into a storage form of an edge table, obtaining and outputting expression forms of the head entity, the relation and the tail entity.

Preferably, the head-tail entity pair for acquiring the threat information text of the relation to be extracted includes:

the RoBERTa model is trimmed for extraction of the head-tail entity pairs.

Preferably, the acquiring the relationship between the head entity and the tail entity in the head-tail entity pair includes:

judging whether only one relation exists between a head entity and a tail entity in the ontology model;

if only one relationship exists, the relationship is used as the relationship between the head entity and the tail entity; otherwise, judging the category of the descriptive text between the head entity and the tail entity, and determining the relation between the head entity and the tail entity according to the category.

Preferably, the determining the category of the descriptive text between the head entity and the tail entity includes: and judging the category of descriptive text between the head entity and the tail entity by using a text classification model which is trained in advance based on the safe text.

Preferably, the determining the category of the descriptive text between the head entity and the tail entity using a text classification model pre-trained based on the secure text includes:

vectorizing descriptive texts between the head entity and the tail entity to obtain vectorized texts;

classifying the categories of the vectorized text to obtain the categories of the descriptive text between the head entity and the tail entity.

Preferably, the threat intelligence relationship extraction method further includes: if the category of the descriptive text between the head entity and the tail entity cannot be judged, the relation with the largest occurrence probability between the head entity and the tail entity in the ontology model is taken as the relation between the head entity and the tail entity.

Preferably, the classifying the scenes of the threat information text of each head and tail entity includes: and classifying scene categories of threat information texts into malicious software activity categories, luxury software analysis categories, vulnerability exploitation analysis categories, malicious software new change categories or information general categories by using a text classification model.

Preferably, before the direct relation between the entity nodes with the association relation on the initial relation graph is deduced according to the relation chain ontology constraint rule, the method further comprises the following steps:

searching the free entity nodes in the initial relation diagram in a safety knowledge graph;

if the free entity node is not searched, deleting the free entity node in the initial relation diagram, recording a log, and manually analyzing by a security analysis personnel to confirm whether the free entity node is supplemented into a security knowledge graph;

if the free entity node is retrieved, the free entity node is expanded outwards in the safety knowledge graph according to a BFS searching strategy, the entity node associated with the free entity node is obtained, whether the associated entity node exists in the initial relation graph is judged, and if yes, the relation between the free entity node and the associated entity node is established in the initial relation graph according to the information in the safety knowledge graph.

Preferably, the deducing, according to the constraint rule of the relationship chain ontology, a direct relationship between entity nodes having an association relationship on the initial relationship graph includes:

calculating a transfer closure between entity nodes on the initial relation graph, and obtaining entity nodes with association relations on the initial relation graph based on the transfer closure;

and calculating to obtain the direct relation between the entity nodes with the association relation on the initial relation graph according to the relation chain ontology constraint rule.

Preferably, after the obtaining the final relationship graph, before converting the final relationship graph into the stored form of the edge table, the method further includes:

enumerating relationships in the final relationship graph;

BFS searching is carried out based on two entity nodes related to the enumeration relation, so that a path containing the enumeration relation is obtained;

calculating the type of the enumeration relation in each path;

for all types of the enumeration relation obtained through calculation, if the type with the largest occurrence number is only one, the type with the largest occurrence number is selected to update the type of the enumeration relation, and if the type with the largest occurrence number is multiple, the type with the largest text probability in the path is selected to update the type of the enumeration relation.

The second aspect of the present invention provides a threat intelligence relationship extraction apparatus based on knowledge inference, comprising:

the head-tail entity pair acquisition module is used for acquiring head-tail entity pairs of threat information texts of the relation to be extracted;

the relationship acquisition module is used for acquiring the relationship between the head entity and the tail entity in the head-tail entity pair;

the initial relation diagram construction module is used for constructing an initial relation diagram according to each head-tail entity pair and the relation thereof;

the relation chain body constraint rule determining module is used for classifying scenes of threat information texts where the head and tail entities are located, and determining corresponding relation chain body constraint rules according to the scene categories;

the final relationship diagram construction module is used for deducing and obtaining a direct relationship between entity nodes with association relationship on the initial relationship diagram according to the relationship chain ontology constraint rule, and establishing the direct relationship on the initial relationship diagram to obtain a final relationship diagram;

and the relation output module is used for converting the final relation graph into a storage form of an edge table, obtaining and outputting the expression form of the head entity, the relation and the tail entity.

A third aspect of the invention provides a memory storing a plurality of instructions for implementing the method as described in the first aspect.

A fourth aspect of the invention provides an electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions loadable and executable by the processor to enable the processor to perform the method of the first aspect.

The beneficial effects of the invention are as follows: according to the threat information relation extraction method, device and electronic equipment based on knowledge inference, the relation extraction problem in the open field is converted into the relation classification problem, in addition, the whole method is inferred based on the knowledge, the quality and the interpretability of relation extraction are guaranteed, and the efficiency of relation extraction in the threat information field is effectively improved. The method can effectively solve the problems of inaccurate extraction relation caused by complex and various threat information texts, incomplete knowledge, strong uncertainty, strong dynamic property and timeliness.

Drawings

FIG. 1 is a schematic flow chart of a threat intelligence relationship extraction method based on knowledge inference in the invention;

FIG. 2 is a final relationship diagram involved in an example of the present invention;

fig. 3 is a schematic functional structure diagram of a threat intelligence relationship extraction apparatus based on knowledge inference according to the present invention.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.

The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.

The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.

The display screen is used for displaying a user interface of each application program.

In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.

Example 1

As shown in fig. 1, an embodiment of the present invention provides a threat intelligence relationship extraction method based on knowledge inference, including: s101, acquiring a head-tail entity pair of threat information texts of a relation to be extracted; s102, acquiring a relation between a head entity and a tail entity in the head-tail entity pair; s103, constructing an initial relation diagram according to each head-tail entity pair and relation thereof; s104, classifying scenes of the threat information texts of the head and tail entities, and determining corresponding constraint rules of the relation chain body according to the scene types; s105, deducing and obtaining a direct relation between entity nodes with association relation on the initial relation graph according to the relation chain ontology constraint rule, and establishing a relation between entity nodes with association relation on the initial relation graph according to the direct relation to obtain a final relation graph; s106, converting the final relation diagram into a storage form of an edge table, obtaining and outputting expression forms of the head entity, the relation and the tail entity.

Wherein, in step S101, the RoBERTa model may be used for extraction of the head-tail entity pair after fine tuning. Specifically, the RoBERTa model (A Robustly Optimized BERT Pretraining Approach, a pre-training method for robust optimization of BERT) can be finely tuned by using the open source model and internal own data, and the model fine tuning refers to adjusting parameters of the model by using new training data on the basis of the pre-training model so as to adapt to new tasks. Model fine-tuning is typically faster and more resource efficient than training the model from scratch. The fine tuning of the model adopted in the invention comprises the following steps: selecting a pre-training model; collecting a new training data set; adding a new output layer for the pre-training model; training a pre-training model on the new training dataset; the performance of the pre-trained model is evaluated. Specifically, the method can comprise the following steps: collecting a head-tail entity pair data set with labels; the data set should contain text and corresponding head-tail entity pairs in the text; the head-tail entity pairs can be marked manually or by using an automatic marking tool; the data set is trimmed using the collected head-to-tail entities on the basis of an official pre-trained model. The fine tuning may be done using official code of the RoBERTa model or other open source code. And inputting threat information text of the relation to be extracted into the trimmed model, and outputting the head-tail entity pair.

As one example, raw text, for example, is as follows: the organization of the country X impersonates a developer and initiates phishing attacks to the employees of the science and technology company on a certain platform through social engineering means. By adopting the method provided by the invention, the head-tail entity pair for extracting the threat information text is as follows: < L organization, social engineering > and < L organization, science and technology company >.

Step S102 is executed to obtain the relationship between the head entity and the tail entity in the head-tail entity pair, which may be implemented by the following steps: judging whether only one relation exists between a head entity and a tail entity in the ontology model; if only one relationship exists, the relationship is used as the relationship between the head entity and the tail entity; otherwise, judging the category of the descriptive text between the head entity and the tail entity, and determining the relation between the head entity and the tail entity according to the category.

In the ontology model, there is generally defined attribute nodes that each entity should contain, and relationships that exist between each entity, and there is typically only one or a limited number of relationships that exist between the head entity and the tail entity. If there is only one relationship, the relationship is used as a relationship between a head entity and a tail entity, for example, for the head entity and the tail entity pair < thread actor, malware >, according to the ontology model, it is known that there is only one relationship of "use" between the two entities, and then the relationship "use" is directly used as a relationship between the head entity and the tail entity pair < thread actor, malware >, and the obtained triplet information is: < wireless action, use, malware >. If there is not only one relationship but a limited number of relationships, it can determine which relationship the head entity and the tail entity belong to by the category of descriptive text between the head entity and the tail entity, thereby realizing relationship extraction between the entities.

As one example, raw text, for example, is as follows: recent researchers find that Torili Trojan is active in the network attack activity of T organization for Y country, and an attacker steals data and traffic of a device implanted with Torili Trojan by sinking network devices such as a domestic public network router, a camera and the like to act as a springboard, installing a traffic forwarding tool. The head-tail entity pairs extracted from the original text are: < T organization, torili Trojan >. The partial constraints of the ontology model include: < hacking organization, attribution to, geographical location >, < hacking organization, use, malware >, < hacking organization, creation, malware >. From the constraint of the ontology model, only the following two relations exist between the head-tail entity pair < T organization, torili Trojan >: < hacking organization, use, malware > and < hacking organization, creation, malware >. Descriptive text between head-tail entity pair < T organization, torii trojan > is: torili Trojan is active in the network attack activity of T organization for Y country.

And performing text coding on the descriptive text between the head and tail entity pairs by using a BERT model trained based on internal safety data, and performing text classification by using a Softmax model to obtain the category of the descriptive text between the head and tail entity pairs. According to the obtained head entity category ' hacking organization ' and the tail entity category ' malicious software ', the relation between the head entity pair and the tail entity pair can be known from the ontology model to only have the following two relations of ' hacking organization ', use, malicious software ' and ' hacking organization, creation and malicious software ', namely, the relation extraction problem between the head entity pair and the tail entity pair is converted into the judgment that descriptive texts between the head entity pair and the tail entity pair belong to the ' use ', ' creation ' or unknown categories. Through text classification, the probability that descriptive text between the head and tail entity pairs < T organization, torili Trojan > belongs to the "use" category is the greatest, namely the relation between the head and tail entity pairs is "use". The relationship between the head and tail entity pairs thus obtained forms triplet information as follows: < T organization, use, torili Trojan >.

In the invention, the category of descriptive text between the head entity and the tail entity can be judged by adopting the following method: and judging the category of descriptive text between the head entity and the tail entity by using a text classification model which is trained in advance based on the safe text. Specifically, it may be: vectorizing descriptive texts between the head entity and the tail entity to obtain vectorized texts; classifying the categories of the vectorized text to obtain the categories of the descriptive text between the head entity and the tail entity. More specifically, text vectorization can be realized by selecting a SecBERT model based on a large number of safe text training, so that the precision in the subsequent task calculation process can be effectively improved; after text vectorization is performed through the SecBERT model, descriptive text can be classified based on the softmax model, and the category of the descriptive text is obtained.

In a preferred embodiment of the present invention, if the category of the descriptive text between the head entity and the tail entity cannot be determined, the relationship with the largest occurrence probability between the head entity and the tail entity in the ontology model is taken as the relationship between the head entity and the tail entity. According to a large amount of historical data, based on data statistics, calculating the probability of different relations between a head entity and a tail entity, and when the category of descriptive text between the head entity and the tail entity is an unknown category or is ordered according to the category probability, and two categories with the same probability exist, according to the relation with the largest occurrence probability between the head entity and the tail entity in an ontology model, as in the ontology model: the probability of < hack organization, use, malware > is 0.6, the probability of < hack organization, create, malware > is 0.2, then the "use" relationship is selected.

Step S103 is executed, where an initial relationship diagram is constructed according to each head-tail entity pair and the relationship thereof, and may be implemented as follows: the head-tail entity pair and the triples of the relation thereof obtained according to the step S101 and the step S102 are regarded as an edge table; establishing an initial relation diagram based on the side table; further, the relationship in the initial relationship graph can be inspected based on the ontology model, and if the relationship between the head entity and the tail entity in the initial relationship graph is not in the relationship set between the entity and the tail entity in the ontology model, the relationship between the head entity and the tail entity in the initial relationship graph is eliminated.

In step S104, classifying the scenes of the threat information text of the head-to-tail entities may include: and classifying scene categories of threat information texts into malicious software activity categories, luxury software analysis categories, vulnerability exploitation analysis categories, malicious software new change categories or information general categories by using a text classification model. Wherein, the text of the malicious software activity class needs to contain malicious software, activity time, infection chains, victim information and the like related to an attacker; the text of the analysis class of the luxury software needs to contain basic information of the luxury software, such as language coding, function, infection chains, encryption algorithm adopted, luxury bill, file extension after encryption and the like; the text of the vulnerability exploitation analysis class needs to contain basic information of the vulnerability, such as the vulnerability number, the vulnerability exploitation process and the like; the new variety of the malicious software text needs to contain functions added by the new variety or distinguishing information with the original malicious software and the like; the text of the intelligence general class is the default text.

Specifically, scene classification can be realized through a text classification model, which comprises the following steps: firstly vectorizing input long text safety information through a SecBERT model customized and trained based on a safety corpus, and then classifying the vectorized text into scenes through a softmax model.

After the scene category of the threat information text where the head and tail entity pair is located is obtained, the corresponding relation chain body constraint rule can be determined according to the scene category. Wherein, the relationship chain ontology constraint rule (RCOR) is a constraint rule for expressing constraints in the relationship chain ontology. The relationship chain ontology is an ontology based on a relationship chain, where the relationship chain is a set of relationships between two instances. RCOR may be used to express various constraints on the relationship chain, e.g., the relationship between two instances must meet a particular condition, or the relationship between two instances must be within a given range.

For example, the following RCOR expresses the constraint that the relationship between two instances must satisfy a particular condition: (x, r, y) - > (T (x), F (y)). For example, in a malware analysis scenario, where x represents a hacking organization, r represents an attack, and y represents an attacked, then the constraints that the < hacking organization, attack, attacked > need to satisfy are: t (x) = < hack organization, use, malware > and F (y) = < malware, attack, attacked >. For example, in the luxury software analysis scenario, x represents a hacking organization, r represents a luxury, and y represents an attacked, then the constraints that the < hacking organization, luxury, attacked > need to satisfy are: t (x) = < hack organization, create, malware > and F (y) = < malware, attack, attacked >. Therefore, RCOR can be used to express various constraints in the relationship chain ontology, which can help ensure that the data in the ontology is consistent and valid.

Specifically, determining the corresponding constraint rule of the relationship chain ontology according to the scene category may refer to the following examples: entity class and relationship constraint definition based on the ontology model, such as: < hacking organization, use, malware >; < hacking, creation, malware >; < malware, attack, attacked >. In a malware analysis scenario, there are the following types of triplet information: < hacking organization, usage, malware > and < malware, attack, attacked >. The relation chain formed by the triplet information has a transfer effect, namely: < hacking organization, r, attacked > - > (< hacking organization, use, malware >, < malware, attack, attacked >). By the relation chain ontology constraint rule, r can be inferred to be "attack". In the lux software analysis scenario, there are triplet information of the following types: < hacking, creation, malware > and < malware, attack, attacked >. The relation chain formed by the triplet information has a transfer effect, namely: < hacking organization, r, attacked > - > (< hacking organization, creation, malware >, < malware >, attack, attacked >). By the constraint rule of the relation chain body, r can be deduced as 'lux'.

In step S105, before deducing, according to the relationship chain ontology constraint rule, a direct relationship between entity nodes having an association relationship on the initial relationship graph, the method may further include: searching the free entity nodes in the initial relation diagram in a safety knowledge graph; if the free entity node is not searched, deleting the free entity node in the initial relation diagram, recording a log, and manually analyzing by a security analysis personnel to confirm whether the free entity node is supplemented into a security knowledge graph; if the free entity node is retrieved, the searching strategy is expanded outwards in the safety knowledge graph according to BFS (Breadth First Search, breadth first searching) to obtain the entity node associated with the free entity node, whether the associated entity node exists in the initial relation graph or not is judged, and if yes, the relation between the free entity node and the associated entity node is established in the initial relation graph according to the information in the safety knowledge graph.

After the free node information is complemented by adopting the method, the direct relationship between the entity nodes with the association relationship on the initial relationship graph can be deduced according to the relationship chain body constraint rule, and the method can be concretely implemented as follows: calculating a transfer closure among entity nodes on the initial relation graph to obtain entity nodes with association relations on the initial relation graph; and calculating to obtain the direct relation between the entity nodes with the association relation on the initial relation graph according to the relation chain ontology constraint rule.

The method comprises the steps of solving a transfer closure between entity nodes on an initial relation graph based on a Floyd algorithm, and obtaining the entity nodes with association relations on the initial relation graph based on the transfer closure.

As an example, there are relationships < APT32, uses, denis > and < Denis, targets, J-state > in the initial relationship diagram, for example. From the results of the transitive closure computation, it can be seen that: there is an association between the two entity nodes "APT32" and "J country". There are the following relationship chain ontology constraint rules: the relationship < heat actor, targets, location > can be deduced from the relationship < heat actor, uses, firmware > and < firmware, targets, location >. According to the relation chain ontology constraint rule, the direct relation < APT32, targets, J country > between the two entity nodes of the APT32 and the J country can be obtained.

And finally, establishing the direct relation on the initial relation diagram to obtain a final relation diagram.

In step S106, after obtaining the final relationship diagram, before converting the final relationship diagram into the stored form of the edge table, the method may further include: enumerating relationships in the final relationship graph; BFS searching is carried out based on two entity nodes related to the enumeration relation, so that a path containing the enumeration relation is obtained; calculating the type of the enumeration relation in each path; for all types of the enumeration relation obtained through calculation, if the type with the largest occurrence number is only one, the type with the largest occurrence number is selected to update the type of the enumeration relation, and if the type with the largest occurrence number is multiple, the type with the largest text probability in the path is selected to update the type of the enumeration relation.

As an example, such as in the final relationship diagram shown in fig. 2: edge AB is the relationship being enumerated. The head entity node a and the tail entity node B are two entity nodes associated with an enumeration relationship. With BFS search, the associated edge expansion is performed by two entity nodes, such as first by node a, e.g., up to 5 hops through to node a. And then extends through the node B, e.g., up to 5 hops to the node B. The specific process can be as follows: step 1, initializing a queue Q to be expanded, an associated leaf node set M of a node A and an associated leaf node set N of a node B. Adding the node A and the node B into a queue Q to be expanded; step 2, selecting a node from the queue Q to be expanded for expansion, such as selecting a nodeA is extended to obtain node E, F. Because node E is a leaf node, only node F is added to the queue Q to be expanded; simultaneously, putting the node E into an associated leaf node set M of the node A; step 3, repeating the step 1 and the step 2 until the queue Q to be expanded is empty; and 4, calculating the Cartesian product of the node pairs of the set M and the set N to obtain head-tail entity pairs as follows:<E，D>and<G，D>wherein node C is an intermediate node; step 5, searching all paths between the head and tail entity pairs obtained in step 4 through BFS, and keeping the existence of the side A in the paths>A path of B; step 6, calculating the side A in each path obtained in step 5>The relationship type of B; a may refer herein to a head entity such as an APT32 attack organization, and B may refer to a tail entity such as Denis malware; step 7, pairing data based on the accumulated mass entities and relationships, such as<APT32, use, denis>、<APT32, creation, denis>And<denis, attack, J state>Calculating the side A in each path by using a trained HMM model (the HMM model is a well-known model)>B is whether the relation type of the B is 'use' or 'creation', so that the text probability of the whole path is maximum; step 8, for all relation types obtained by calculation of each path, if the type with the largest occurrence number is only one, selecting the relation type update side A ∈>B relationship type. If the relation type with the largest occurrence number is multiple, selecting the relation type update side A in the path with the largest text probability>Relationship of B.

And finally, converting the final relation diagram into a storage form of an edge table, obtaining and outputting expression forms of the head entity, the relation and the tail entity.

Example two

As shown in fig. 3, another aspect of the present invention further includes a functional module architecture that is completely consistent with the foregoing method flow, that is, the embodiment of the present invention further provides a threat intelligence relationship extraction device based on knowledge inference, including: a head-tail entity pair acquisition module 201, configured to acquire a head-tail entity pair of threat information text of a relationship to be extracted; a relationship obtaining module 202, configured to obtain a relationship between the head entity and the tail entity in the head-tail entity pair; the initial relation diagram construction module 203 is configured to construct an initial relation diagram according to each head-tail entity pair and the relation thereof; the relationship chain body constraint rule determining module 204 is configured to classify scenes of threat information texts in which the head and tail entities are located, and determine corresponding relationship chain body constraint rules according to the scene types; the final relationship diagram construction module 205 is configured to infer a direct relationship between entity nodes having an association relationship on the initial relationship diagram according to the relationship chain ontology constraint rule, and establish the direct relationship on the initial relationship diagram to obtain a final relationship diagram; and the relationship output module 206 is configured to convert the final relationship graph into a storage form of an edge table, obtain and output an expression form of < head entity, relationship, tail entity >.

In the head-tail entity pair obtaining module, the head-tail entity pair for obtaining the threat information text of the relation to be extracted includes: the RoBERTa model is trimmed for extraction of the head-tail entity pairs.

In the relationship obtaining module, the obtaining the relationship between the head entity and the tail entity in the head-tail entity pair includes: judging whether only one relation exists between a head entity and a tail entity in the ontology model; if only one relationship exists, the relationship is used as the relationship between the head entity and the tail entity; otherwise, judging the category of the descriptive text between the head entity and the tail entity, and determining the relation between the head entity and the tail entity according to the category.

Wherein, the judging the category of the descriptive text between the head entity and the tail entity comprises: and judging the category of descriptive text between the head entity and the tail entity by using a text classification model which is trained in advance based on the safe text.

Further, the determining the category of the descriptive text between the head entity and the tail entity by using the text classification model trained based on the safe text comprises: vectorizing descriptive texts between the head entity and the tail entity to obtain vectorized texts; classifying the categories of the vectorized text to obtain the categories of the descriptive text between the head entity and the tail entity.

Further, if the category of the descriptive text between the head entity and the tail entity cannot be judged, the relationship with the largest occurrence probability between the head entity and the tail entity in the ontology model is taken as the relationship between the head entity and the tail entity.

In the relation chain ontology constraint rule determining module, the scene classification of the threat information text where each head and tail entity is located includes: and classifying scene categories of threat information texts into malicious software activity categories, luxury software analysis categories, vulnerability exploitation analysis categories, malicious software new change categories or information general categories by using a text classification model.

In the final relationship diagram construction module, before deducing the direct relationship between the entity nodes with the association relationship on the initial relationship diagram according to the relationship chain ontology constraint rule, the method further comprises the following steps: searching the free entity nodes in the initial relation diagram in a safety knowledge graph; if the free entity node is not searched, deleting the free entity node in the initial relation diagram, recording a log, and manually analyzing by a security analysis personnel to confirm whether the free entity node is supplemented into a security knowledge graph; if the free entity node is retrieved, the free entity node is expanded outwards in the safety knowledge graph according to a BFS searching strategy, the entity node associated with the free entity node is obtained, whether the associated entity node exists in the initial relation graph is judged, and if yes, the relation between the free entity node and the associated entity node is established in the initial relation graph according to the information in the safety knowledge graph.

Further, the deducing, according to the constraint rule of the relationship chain ontology, a direct relationship between entity nodes having an association relationship on the initial relationship graph includes: calculating a transfer closure between entity nodes on the initial relation graph, and obtaining entity nodes with association relations on the initial relation graph based on the transfer closure; and calculating to obtain the direct relation between the entity nodes with the association relation on the initial relation graph according to the relation chain ontology constraint rule.

In the relationship output module, after the final relationship graph is obtained, before the final relationship graph is converted into the storage form of the edge table, the relationship output module further includes: enumerating relationships in the final relationship graph; BFS searching is carried out based on two entity nodes related to the enumeration relation, so that a path containing the enumeration relation is obtained; calculating the type of the enumeration relation in each path; for all types of the enumeration relation obtained through calculation, if the type with the largest occurrence number is only one, the type with the largest occurrence number is selected to update the type of the enumeration relation, and if the type with the largest occurrence number is multiple, the type with the largest text probability in the path is selected to update the type of the enumeration relation.

The device may be implemented by the threat intelligence relationship extraction method based on knowledge inference provided in the first embodiment, and the specific implementation method may be referred to the description in the first embodiment, which is not repeated herein.

The invention also provides a memory storing a plurality of instructions for implementing the method according to embodiment one.

The invention also provides an electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions loadable and executable by the processor to enable the processor to perform the method of embodiment one.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A threat intelligence relationship extraction method based on knowledge inference, comprising:

2. The method for extracting threat intelligence relationship based on knowledge inference as claimed in claim 1, wherein the head-to-tail entity pair for obtaining threat intelligence text of the relationship to be extracted comprises:

the RoBERTa model is trimmed for extraction of the head-tail entity pairs.

3. The knowledge-based inferred threat intelligence relationship extraction method of claim 1, wherein said obtaining a relationship between a head entity and a tail entity in said head-to-tail entity comprises:

4. The knowledge-based inferred threat intelligence relationship extraction method of claim 3, wherein said determining a category of descriptive text between a head entity and a tail entity comprises: and judging the category of descriptive text between the head entity and the tail entity by using a text classification model which is trained in advance based on the safe text.

5. The knowledge-based inferred threat intelligence relationship extraction method of claim 4, wherein said determining the category of descriptive text between the head entity and the tail entity using a text classification model pre-trained based on secure text comprises:

6. The knowledge-based inferred threat intelligence relationship extraction method of claim 3, wherein the threat intelligence relationship extraction method further comprises: if the category of the descriptive text between the head entity and the tail entity cannot be judged, the relation with the largest occurrence probability between the head entity and the tail entity in the ontology model is taken as the relation between the head entity and the tail entity.

7. The knowledge-based inferred threat intelligence relationship extraction method of claim 1, wherein said scene classification of the threat intelligence text in which each of said head-to-tail entities is located comprises: and classifying scene categories of threat information texts into malicious software activity categories, luxury software analysis categories, vulnerability exploitation analysis categories, malicious software new change categories or information general categories by using a text classification model.

8. The method for extracting threat intelligence relationship based on knowledge inference as claimed in claim 1, wherein before the direct relationship between entity nodes having an association relationship on the initial relationship graph is inferred according to the relationship chain ontology constraint rule, further comprising:

9. The method for extracting threat intelligence relationship based on knowledge inference as claimed in claim 8, wherein said inferring a direct relationship between entity nodes having an association relationship on the initial relationship graph according to the relationship chain ontology constraint rule comprises:

10. The knowledge-based inferred threat intelligence relationship extraction method of claim 1, wherein after said obtaining a final relationship graph, before converting said final relationship graph into a stored form of an edge table, further comprises:

enumerating relationships in the final relationship graph;

calculating the type of the enumeration relation in each path;

11. A threat intelligence relationship extraction apparatus based on knowledge inference, comprising:

12. A memory, characterized in that a plurality of instructions for implementing the method according to any of claims 1-10 are stored.

13. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method of any one of claims 1-10.