CN116662557A

CN116662557A - Entity relation extraction method and device in network security field

Info

Publication number: CN116662557A
Application number: CN202210141506.9A
Authority: CN
Inventors: 张静; 张海霞; 连一峰; 黄克振; 刘倩
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2023-08-29

Abstract

The invention discloses a method and a device for extracting entity relation in the field of network security, which relate to the field of network security, and the method and the device generate a semantic matrix of each segment by exhausting segments with a certain length in sentences of multi-source heterogeneous network security data according to the characteristics of targets focused in the field of network security, thereby improving the accuracy of entity identification models; and re-encoding the entity vector on the basis, and supplementing the boundary of the entity host and guest, the entity type and the attribute feature into the input of the relation extraction model to obtain a relation extraction model with more accurate result, thereby reducing error propagation. Further, the invention carries out screening judgment on the fragments which cannot identify the entity type and have higher occurrence frequency, supplements the fragments into the entity type set and the entity relation set, carries out continuous optimization and feedback, and improves the identification breadth and accuracy of the model.

Description

Entity relation extraction method and device in network security field

Technical Field

The present invention relates to the field of network security, and in particular, to a method and an apparatus for extracting entity relationships in the field of network security.

Background

With the rapid development of internet technology, network security events frequently occur, a large amount of data in various different forms are generated every day, including event clues, threat information, security notification and the like, key information is rapidly and effectively extracted from the data, potential relations among the data are mined, and important technical support can be provided for threat information analysis and network security defense. At present, in the aspects of key information extraction and potential relation mining, a technical means of combining entity identification and entity relation extraction is generally adopted, and two main modes are adopted: one is a joint model, namely, a solid model and a relation model are subjected to joint training; the other is a pipeline type, the text is input into the entity model to acquire the entity, and then the entity pair is used as the input of the relation model to acquire the direct relation of the entity pair, so that the method is flexible, however, the problem of error propagation exists, namely, if the entity model has errors in the process of identifying the entity, the effect of the following relation model can be directly influenced.

Disclosure of Invention

The invention provides a method and a device for extracting entity relations according to entity vector quantities on the basis of entity identification in order to accurately extract entity relations contained in network security text data, so as to improve the accuracy of entity identification and reduce error propagation in the entity identification process.

The invention adopts the following technical scheme:

a method for extracting entity relation in the network security field comprises the following steps:

acquiring network security data of multiple source heterogeneous, and exhausting all substrings in each sentence in the network security data to obtain a fragment set of each sentence; obtaining word vectors of all words contained in each segment, and forming a semantic matrix of each segment;

inputting a semantic matrix of each segment of each sentence into a trained entity recognition model for recognition, wherein the entity recognition model is formed by two layers of feedforward neural networks, and carrying out normalization operation on a recognition result through a normalization exponential function softmax to obtain all entities and corresponding entity types in each sentence;

the method comprises the steps of obtaining a plurality of entity pairs by pairing entities in the same sentence, recoding vectors of the entity pairs, adding a main guest boundary identifier and an entity type identifier of each entity, extracting attribute characteristics of each entity, adding the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity into the vectors of the entity pairs, obtaining semantic vectors of the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity, and outputting the encoded vectors of the entity pairs;

inputting the coded entity pair vector into a trained relation extraction model based on a neural network, and recording the relation type with the largest output probability after the softmax layer as the relation among the entity pairs.

Further, word vectors for words are obtained by means of the pre-trained language model BERT.

Further, semantic vectors of the main client boundary identifications, the entity type identifications and the attribute features of the entities are obtained through the pre-training language model BERT.

Further, the entity types include a general class, a network security personnel class, a network security organization class, a network security asset class, a network security system class, and a network security resource class.

Further, each entity type forms an entity type set, and the entity type set also comprises a non-determined entity type item which is used for expanding according to the actually identified entity type which does not belong to the known entity type; and the entity relation set is formed by the relation among the entity pairs and the relation among different entity types, and the entity relation set also comprises a non-determined entity relation item which is used for expanding according to the actually identified entity relation which does not belong to the known entity relation.

Further, the entity recognition model filters and recognizes the fragments which cannot be judged, screens out and judges the first several fragments with highest occurrence frequency in the actual scene, and expands the entity type set as the non-determined entity type item according to the requirement of network security entity recognition.

Further, in the two-layer feedforward neural network of the entity identification model, the activation function of the first layer hidden layer adopts a linear rectification function, and the number of neurons of the second layer hidden layer is the same as the number of types in the entity type set.

Further, adding the boundary identification of the main object of the entity refers to marking the initial word and the final word in the main object and the object, specifically adding corresponding identification symbols to the word vectors of the initial word and the final word of the main object respectively, and adding corresponding identification symbols to the word vectors of the initial word and the final word of the object respectively.

An entity relationship extraction apparatus in the field of network security comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method when executing the program.

A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the method described above.

According to the characteristics of the object of interest in the network security field, the semantic matrix of each segment is generated by exhausting segments with a certain length in sentences of the multi-source heterogeneous network security data, so that the accuracy of the entity identification model is improved; and re-encoding the entity vector on the basis, and supplementing the boundary (position feature), entity type and attribute feature of the entity host and object into the input of the relation extraction model to obtain a relation extraction model with more accurate result and reduce error propagation. Further, the invention carries out screening judgment on the fragments which cannot identify the entity type and have higher occurrence frequency, supplements the fragments into the entity type set and the entity relation set, carries out continuous optimization and feedback, and improves the identification breadth and accuracy of the model.

Drawings

The invention is further described below with reference to the drawings and examples.

Fig. 1 is a flow chart of a method for entity relationship extraction in the field of network security according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

Step one: a set of entity types and a set of relationship types for which the task is intended to output are defined.

After the multi-source heterogeneous network security data are acquired, integrating the data, screening entity sets and relation sets which possibly exist in the data, and preparing for a later entity identification model and a relation extraction model. The entity types that need to be identified are initially defined as the following 6-class entity type set:

1) General class: { "person", "place", "time", "facility", "location", "unit" }

2) Network security personnel: { "hacker", "expert" }

3) Network security organization class: { "attack organization", "guard organization" }

4) Network security system class: { "System" }

5) Network security assets: { "asset" }

6) Network security resource class: { "IP address", "web address", "domain name", "network identity account type", "harmful program", "vulnerability" }.

In addition, e.g.If the result of identifying a segment in a sentence does not belong to any of the entity sets, using E _e And (3) representing. Thus, the entity type set defines: { "person", "place", "time", "facility", "location", "unit", "hacker", "expert", "attack organization", "guard organization", "system", "asset", "IP address", "URL", "domain name", "network identity account type", "harmful program", "vulnerability", "E" _e }。

By comprehensively considering the internal relations of the various entities and the relations among different entity types, the related relation set is defined as: { "same unit", "upper and lower level", "responsible", "same organization", "attack", "protection", "remote connection", "residing", "job", "utilization", "protection", "attribution", "implantation", "DNS resolution", "reverse DNS", "associated web site", "E" _r E, where E _r Indicating that there is no relationship between the two entities.

Step two: a semantic matrix is obtained for each segment in the sentence.

The input sentence is denoted by Z, sentence Z is denoted by Z ₁ ,z ₂ ,z ₃ ,…,z _n These n words are composed. All possible substrings in the sentence Z are exhausted, a fragment set of the sentence Z can be obtained, and the fragment set of the sentence Z is defined as S= { S ₁ ,s ₂ ,s ₃ ,…,s _m -wherein s is a segment, made up of words; in order to avoid excessive elements in the set S, the number of words contained in each substring (i.e., segment) is at most a set value L. Through a pre-training language model BERT (without training, a BERT-Base Chinese model with Google as an open source model, a Chinese-BERT-wwm model with Hadamard, etc. can be selected), the word vector of each word in the sentence Z can be obtained, wherein the word Z _i Word vector of (a) isFor the purpose of performing the following entity recognition task, the segment s will now be _i Semantic matrix definition of (2)Is->Fragment s _i Consists of several consecutive words, +.>Representing segment s _i Word vector of the t-th word in (a).

Step three: and carrying out entity identification task on each fragment in the sentence.

The entity recognition model is composed of two layers of feedforward neural networks, the input is a semantic matrix of a certain segment, the number of neurons of a first layer of hidden layer is set to be 100, and a linear rectification function (Rectified Linear Unit, for short, reLU) is selected as an activation function; the number of neurons of the second hidden layer is the same as the number in the entity type set defined in the step one, and the entity type is expressed as e epsilon, wherein epsilon is the entity type set. Fragment s _i The output result vector obtained after input to the neural network is defined as y ₁ (h(s _i ) And therefore, fragment s _i The probability of belonging to entity type e is: p (P) _e (e|s _i )＝softmax(y ₁ (h(s _i ))). After training the entity recognition model, the entity type with the largest output probability is marked as the entity type corresponding to the input fragment.

When training the entity recognition model, adjusting parameters such as the number of layers of the neural network and the number of neurons of each layer, minimizing a cross entropy loss function, and continuously optimizing the entity recognition model; after the model is trained, the model is evaluated for quality by F1-score (a harmonic mean of accuracy and recall) on the test set. Before training, the input training data needs to be marked, and the entity and the corresponding entity type contained in the text are specifically marked. For example, B represents an entity start, I represents an entity intermediate location, O represents not within the set of entity types, examples: old b_loc; gold I_LOC; mountain i_loc; weighing O; o is included; to O; the lux_att; cable i_att; soft i_att; part I_ATT; group i_att; partner i_att; o of (c); a net O; complexing O; tapping; and (5) hitting O. Wherein the letter "LOC", "ATT" following the underline is the entity type.

Step four: the entity re-encodes the vector.

All the entities and the corresponding entity types in the sentence Z can be obtained through the third step, and a plurality of entity pairs can be obtained through pairing the entities in pairs. The relation extraction model aims at obtaining the input entity pair s _i ，s _j Relation r between _ij E, R, wherein the set R is a relation set defined in the first step, and the number of neurons of a neural network output layer of the relation extraction model is the number of relations in the defined relation set. In order to more accurately output the relationship between the entity pairs, the following three operations are performed:

1. the two entities within the entity pair are subject to distinction from the object.

The entity located relatively forward of the sentence is considered the subject, and the entity located relatively rearward of the sentence is considered the object. Due to the entity s in the sentence _i Is composed of several words, a main body s _i And objects s _j The beginning and ending words within are labeled, subject s _i The word vector of the start word of (a) is increased with an identification symbol<S>The word vector of the last word is augmented with an identification symbol</S>. Object s _j The word vector of the start word of (a) is increased with an identification symbol<O>The word vector of the last word is augmented with an identification symbol</O>。

2. Entity type identification is added in the input vector of the relation extraction model.

A unique symbol identifier is defined for each entity type, and then the entity type symbols are added to the beginning and ending words of the entity. Such as: the symbol of the "attack organization" type is "ORG", then at entity s _i Increasing the identifier number for the beginning word of the word<ORG>At entity s _i Adding an identifier to the last word of (a)</ORG>。

3. And extracting the attribute of the entity identified by the entity identification model, and adding the attribute characteristics into the entity pair vector.

Because the relation extraction is dependent on not only entity and entity types, but also what is more important is that the entity of the type is characterized by which attributes, the attribute characteristics associated with each entity are obtained by carrying out syntactic analysis on two entities to be input by the relation extraction model, and the attribute characteristics are added into the entity pair vector to be used as one of the input characteristics of the relation extraction model. The attribute features describe attributes of the corresponding entities, and the attributes associated with the entities are marked when the data set is marked. Such as the example "luxo software partner" entity described above, may include attributes such as: country-country, organization head-organization, hold time-1 month in 2002; adding the attribute features into the entity pair vector, namely inputting the attribute values of 'certain country', 'certain organization', '1 month in 2002' into the BERT model to obtain a word vector, and then splicing the word vector of the attribute values behind the entity pair vector.

The identifier and the attribute used above may be passed through the pre-trained language model BERT to obtain the corresponding semantic vector.

Step five: and taking the entity pair vector as the input of a relation extraction model to acquire the relation of the entity pair.

Entity s after re-encoding step four _i Is expressed as H(s) _i ) For input entity pairs s _i ，s _j The output vector through the relational extraction model neural network is defined as y ₂ (H(s _i ),H(s _j ) Pair s) of entities _i ，s _j The probability of the relation r is: p (P) _r (r|s _i ,s _j )＝softmax(y ₂ (H(s _i ),H(s _j ) And) training a relation extraction model, and then marking the relation type with the maximum output probability as an input entity pair s _i ，s _j A relationship exists between them.

The relation extraction model is similar to the entity identification model, the model is a feedforward neural network, the loss function is cross entropy, and the training process is as follows: the parameters are adjusted to minimize the cross entropy loss function. The training data needs to be marked, and the marking content is < relation between entity A, entity B and AB >, such as < Lesu software group partner, city and attack >.

Step six: and filtering and identifying the types which cannot be judged by the entity identification model, and expanding the entity type set.

The entity recognition model judges the type which cannot be used and recognized as epsilon _e In order to better support network security threat information analysis and network security defense, the segments are required to be filtered and identified, the segments with higher occurrence frequency in the actual scene are screened out and judged, and a predefined entity type set is expanded according to the network security entity identification requirement.

The above description describes the present invention in order to enable one skilled in the art to understand the present invention and to implement it according to the present invention, and is not intended to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be included in the scope of the present invention.

Claims

1. The entity relation extraction method in the network security field is characterized by comprising the following steps:

2. The method of claim 1, wherein word vectors for words are obtained by a pre-trained language model BERT.

3. The method of claim 1, wherein semantic vectors of host-guest boundary identifications, entity type identifications, and attribute features of an entity are obtained by a pre-trained language model BERT.

4. The method of claim 1, wherein the entity types include a general class, a network security personnel class, a network security organization class, a network security asset class, a network security system class, and a network security resource class.

5. The method according to claim 1 or 4, wherein each entity type constitutes a set of entity types, and the set of entity types further includes a non-deterministic entity type item for expansion according to the actually recognized entity types that are not known; and the entity relation set is formed by the relation among the entity pairs and the relation among different entity types, and the entity relation set also comprises a non-determined entity relation item which is used for expanding according to the actually identified entity relation which does not belong to the known entity relation.

6. The method of claim 5, wherein the entity recognition model filters and recognizes the undetermined segments, screens out and determines the first several segments with the highest occurrence frequency in the actual scene, and expands the entity type set as the non-determined entity type item according to the network security entity recognition requirement.

7. The method of claim 1, wherein in the two-layer feedforward neural network of the entity identification model, the activation function of the first hidden layer is a linear rectification function, and the number of neurons of the second hidden layer is the same as the number of types in the entity type set.

8. The method of claim 1, wherein adding the host-guest boundary identification of the entity refers to labeling the start word and the end word in the host and the guest, specifically adding corresponding identification symbols to word vectors of the start word and the end word of the host, and adding corresponding identification symbols to word vectors of the start word and the end word of the guest.

9. An entity-relationship extraction apparatus in the field of network security, comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the steps of the method of any of claims 1-8 when the program is executed.

10. A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the method of any one of claims 1-8.