CN116662557A - Entity relation extraction method and device in network security field - Google Patents

Entity relation extraction method and device in network security field Download PDF

Info

Publication number
CN116662557A
CN116662557A CN202210141506.9A CN202210141506A CN116662557A CN 116662557 A CN116662557 A CN 116662557A CN 202210141506 A CN202210141506 A CN 202210141506A CN 116662557 A CN116662557 A CN 116662557A
Authority
CN
China
Prior art keywords
entity
network security
relation
word
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210141506.9A
Other languages
Chinese (zh)
Inventor
张静
张海霞
连一峰
黄克振
刘倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202210141506.9A priority Critical patent/CN116662557A/en
Publication of CN116662557A publication Critical patent/CN116662557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method and a device for extracting entity relation in the field of network security, which relate to the field of network security, and the method and the device generate a semantic matrix of each segment by exhausting segments with a certain length in sentences of multi-source heterogeneous network security data according to the characteristics of targets focused in the field of network security, thereby improving the accuracy of entity identification models; and re-encoding the entity vector on the basis, and supplementing the boundary of the entity host and guest, the entity type and the attribute feature into the input of the relation extraction model to obtain a relation extraction model with more accurate result, thereby reducing error propagation. Further, the invention carries out screening judgment on the fragments which cannot identify the entity type and have higher occurrence frequency, supplements the fragments into the entity type set and the entity relation set, carries out continuous optimization and feedback, and improves the identification breadth and accuracy of the model.

Description

Entity relation extraction method and device in network security field
Technical Field
The present invention relates to the field of network security, and in particular, to a method and an apparatus for extracting entity relationships in the field of network security.
Background
With the rapid development of internet technology, network security events frequently occur, a large amount of data in various different forms are generated every day, including event clues, threat information, security notification and the like, key information is rapidly and effectively extracted from the data, potential relations among the data are mined, and important technical support can be provided for threat information analysis and network security defense. At present, in the aspects of key information extraction and potential relation mining, a technical means of combining entity identification and entity relation extraction is generally adopted, and two main modes are adopted: one is a joint model, namely, a solid model and a relation model are subjected to joint training; the other is a pipeline type, the text is input into the entity model to acquire the entity, and then the entity pair is used as the input of the relation model to acquire the direct relation of the entity pair, so that the method is flexible, however, the problem of error propagation exists, namely, if the entity model has errors in the process of identifying the entity, the effect of the following relation model can be directly influenced.
Disclosure of Invention
The invention provides a method and a device for extracting entity relations according to entity vector quantities on the basis of entity identification in order to accurately extract entity relations contained in network security text data, so as to improve the accuracy of entity identification and reduce error propagation in the entity identification process.
The invention adopts the following technical scheme:
a method for extracting entity relation in the network security field comprises the following steps:
acquiring network security data of multiple source heterogeneous, and exhausting all substrings in each sentence in the network security data to obtain a fragment set of each sentence; obtaining word vectors of all words contained in each segment, and forming a semantic matrix of each segment;
inputting a semantic matrix of each segment of each sentence into a trained entity recognition model for recognition, wherein the entity recognition model is formed by two layers of feedforward neural networks, and carrying out normalization operation on a recognition result through a normalization exponential function softmax to obtain all entities and corresponding entity types in each sentence;
the method comprises the steps of obtaining a plurality of entity pairs by pairing entities in the same sentence, recoding vectors of the entity pairs, adding a main guest boundary identifier and an entity type identifier of each entity, extracting attribute characteristics of each entity, adding the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity into the vectors of the entity pairs, obtaining semantic vectors of the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity, and outputting the encoded vectors of the entity pairs;
inputting the coded entity pair vector into a trained relation extraction model based on a neural network, and recording the relation type with the largest output probability after the softmax layer as the relation among the entity pairs.
Further, word vectors for words are obtained by means of the pre-trained language model BERT.
Further, semantic vectors of the main client boundary identifications, the entity type identifications and the attribute features of the entities are obtained through the pre-training language model BERT.
Further, the entity types include a general class, a network security personnel class, a network security organization class, a network security asset class, a network security system class, and a network security resource class.
Further, each entity type forms an entity type set, and the entity type set also comprises a non-determined entity type item which is used for expanding according to the actually identified entity type which does not belong to the known entity type; and the entity relation set is formed by the relation among the entity pairs and the relation among different entity types, and the entity relation set also comprises a non-determined entity relation item which is used for expanding according to the actually identified entity relation which does not belong to the known entity relation.
Further, the entity recognition model filters and recognizes the fragments which cannot be judged, screens out and judges the first several fragments with highest occurrence frequency in the actual scene, and expands the entity type set as the non-determined entity type item according to the requirement of network security entity recognition.
Further, in the two-layer feedforward neural network of the entity identification model, the activation function of the first layer hidden layer adopts a linear rectification function, and the number of neurons of the second layer hidden layer is the same as the number of types in the entity type set.
Further, adding the boundary identification of the main object of the entity refers to marking the initial word and the final word in the main object and the object, specifically adding corresponding identification symbols to the word vectors of the initial word and the final word of the main object respectively, and adding corresponding identification symbols to the word vectors of the initial word and the final word of the object respectively.
An entity relationship extraction apparatus in the field of network security comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method when executing the program.
A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the method described above.
According to the characteristics of the object of interest in the network security field, the semantic matrix of each segment is generated by exhausting segments with a certain length in sentences of the multi-source heterogeneous network security data, so that the accuracy of the entity identification model is improved; and re-encoding the entity vector on the basis, and supplementing the boundary (position feature), entity type and attribute feature of the entity host and object into the input of the relation extraction model to obtain a relation extraction model with more accurate result and reduce error propagation. Further, the invention carries out screening judgment on the fragments which cannot identify the entity type and have higher occurrence frequency, supplements the fragments into the entity type set and the entity relation set, carries out continuous optimization and feedback, and improves the identification breadth and accuracy of the model.
Drawings
The invention is further described below with reference to the drawings and examples.
Fig. 1 is a flow chart of a method for entity relationship extraction in the field of network security according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
Step one: a set of entity types and a set of relationship types for which the task is intended to output are defined.
After the multi-source heterogeneous network security data are acquired, integrating the data, screening entity sets and relation sets which possibly exist in the data, and preparing for a later entity identification model and a relation extraction model. The entity types that need to be identified are initially defined as the following 6-class entity type set:
1) General class: { "person", "place", "time", "facility", "location", "unit" }
2) Network security personnel: { "hacker", "expert" }
3) Network security organization class: { "attack organization", "guard organization" }
4) Network security system class: { "System" }
5) Network security assets: { "asset" }
6) Network security resource class: { "IP address", "web address", "domain name", "network identity account type", "harmful program", "vulnerability" }.
In addition, e.g.If the result of identifying a segment in a sentence does not belong to any of the entity sets, using E e And (3) representing. Thus, the entity type set defines: { "person", "place", "time", "facility", "location", "unit", "hacker", "expert", "attack organization", "guard organization", "system", "asset", "IP address", "URL", "domain name", "network identity account type", "harmful program", "vulnerability", "E" e }。
By comprehensively considering the internal relations of the various entities and the relations among different entity types, the related relation set is defined as: { "same unit", "upper and lower level", "responsible", "same organization", "attack", "protection", "remote connection", "residing", "job", "utilization", "protection", "attribution", "implantation", "DNS resolution", "reverse DNS", "associated web site", "E" r E, where E r Indicating that there is no relationship between the two entities.
Step two: a semantic matrix is obtained for each segment in the sentence.
The input sentence is denoted by Z, sentence Z is denoted by Z 1 ,z 2 ,z 3 ,…,z n These n words are composed. All possible substrings in the sentence Z are exhausted, a fragment set of the sentence Z can be obtained, and the fragment set of the sentence Z is defined as S= { S 1 ,s 2 ,s 3 ,…,s m -wherein s is a segment, made up of words; in order to avoid excessive elements in the set S, the number of words contained in each substring (i.e., segment) is at most a set value L. Through a pre-training language model BERT (without training, a BERT-Base Chinese model with Google as an open source model, a Chinese-BERT-wwm model with Hadamard, etc. can be selected), the word vector of each word in the sentence Z can be obtained, wherein the word Z i Word vector of (a) isFor the purpose of performing the following entity recognition task, the segment s will now be i Semantic matrix definition of (2)Is->Fragment s i Consists of several consecutive words, +.>Representing segment s i Word vector of the t-th word in (a).
Step three: and carrying out entity identification task on each fragment in the sentence.
The entity recognition model is composed of two layers of feedforward neural networks, the input is a semantic matrix of a certain segment, the number of neurons of a first layer of hidden layer is set to be 100, and a linear rectification function (Rectified Linear Unit, for short, reLU) is selected as an activation function; the number of neurons of the second hidden layer is the same as the number in the entity type set defined in the step one, and the entity type is expressed as e epsilon, wherein epsilon is the entity type set. Fragment s i The output result vector obtained after input to the neural network is defined as y 1 (h(s i ) And therefore, fragment s i The probability of belonging to entity type e is: p (P) e (e|s i )=softmax(y 1 (h(s i ))). After training the entity recognition model, the entity type with the largest output probability is marked as the entity type corresponding to the input fragment.
When training the entity recognition model, adjusting parameters such as the number of layers of the neural network and the number of neurons of each layer, minimizing a cross entropy loss function, and continuously optimizing the entity recognition model; after the model is trained, the model is evaluated for quality by F1-score (a harmonic mean of accuracy and recall) on the test set. Before training, the input training data needs to be marked, and the entity and the corresponding entity type contained in the text are specifically marked. For example, B represents an entity start, I represents an entity intermediate location, O represents not within the set of entity types, examples: old b_loc; gold I_LOC; mountain i_loc; weighing O; o is included; to O; the lux_att; cable i_att; soft i_att; part I_ATT; group i_att; partner i_att; o of (c); a net O; complexing O; tapping; and (5) hitting O. Wherein the letter "LOC", "ATT" following the underline is the entity type.
Step four: the entity re-encodes the vector.
All the entities and the corresponding entity types in the sentence Z can be obtained through the third step, and a plurality of entity pairs can be obtained through pairing the entities in pairs. The relation extraction model aims at obtaining the input entity pair s i ,s j Relation r between ij E, R, wherein the set R is a relation set defined in the first step, and the number of neurons of a neural network output layer of the relation extraction model is the number of relations in the defined relation set. In order to more accurately output the relationship between the entity pairs, the following three operations are performed:
1. the two entities within the entity pair are subject to distinction from the object.
The entity located relatively forward of the sentence is considered the subject, and the entity located relatively rearward of the sentence is considered the object. Due to the entity s in the sentence i Is composed of several words, a main body s i And objects s j The beginning and ending words within are labeled, subject s i The word vector of the start word of (a) is increased with an identification symbol<S>The word vector of the last word is augmented with an identification symbol</S>. Object s j The word vector of the start word of (a) is increased with an identification symbol<O>The word vector of the last word is augmented with an identification symbol</O>。
2. Entity type identification is added in the input vector of the relation extraction model.
A unique symbol identifier is defined for each entity type, and then the entity type symbols are added to the beginning and ending words of the entity. Such as: the symbol of the "attack organization" type is "ORG", then at entity s i Increasing the identifier number for the beginning word of the word<ORG>At entity s i Adding an identifier to the last word of (a)</ORG>。
3. And extracting the attribute of the entity identified by the entity identification model, and adding the attribute characteristics into the entity pair vector.
Because the relation extraction is dependent on not only entity and entity types, but also what is more important is that the entity of the type is characterized by which attributes, the attribute characteristics associated with each entity are obtained by carrying out syntactic analysis on two entities to be input by the relation extraction model, and the attribute characteristics are added into the entity pair vector to be used as one of the input characteristics of the relation extraction model. The attribute features describe attributes of the corresponding entities, and the attributes associated with the entities are marked when the data set is marked. Such as the example "luxo software partner" entity described above, may include attributes such as: country-country, organization head-organization, hold time-1 month in 2002; adding the attribute features into the entity pair vector, namely inputting the attribute values of 'certain country', 'certain organization', '1 month in 2002' into the BERT model to obtain a word vector, and then splicing the word vector of the attribute values behind the entity pair vector.
The identifier and the attribute used above may be passed through the pre-trained language model BERT to obtain the corresponding semantic vector.
Step five: and taking the entity pair vector as the input of a relation extraction model to acquire the relation of the entity pair.
Entity s after re-encoding step four i Is expressed as H(s) i ) For input entity pairs s i ,s j The output vector through the relational extraction model neural network is defined as y 2 (H(s i ),H(s j ) Pair s) of entities i ,s j The probability of the relation r is: p (P) r (r|s i ,s j )=softmax(y 2 (H(s i ),H(s j ) And) training a relation extraction model, and then marking the relation type with the maximum output probability as an input entity pair s i ,s j A relationship exists between them.
The relation extraction model is similar to the entity identification model, the model is a feedforward neural network, the loss function is cross entropy, and the training process is as follows: the parameters are adjusted to minimize the cross entropy loss function. The training data needs to be marked, and the marking content is < relation between entity A, entity B and AB >, such as < Lesu software group partner, city and attack >.
Step six: and filtering and identifying the types which cannot be judged by the entity identification model, and expanding the entity type set.
The entity recognition model judges the type which cannot be used and recognized as epsilon e In order to better support network security threat information analysis and network security defense, the segments are required to be filtered and identified, the segments with higher occurrence frequency in the actual scene are screened out and judged, and a predefined entity type set is expanded according to the network security entity identification requirement.
The above description describes the present invention in order to enable one skilled in the art to understand the present invention and to implement it according to the present invention, and is not intended to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be included in the scope of the present invention.

Claims (10)

1. The entity relation extraction method in the network security field is characterized by comprising the following steps:
acquiring network security data of multiple source heterogeneous, and exhausting all substrings in each sentence in the network security data to obtain a fragment set of each sentence; obtaining word vectors of all words contained in each segment, and forming a semantic matrix of each segment;
inputting a semantic matrix of each segment of each sentence into a trained entity recognition model for recognition, wherein the entity recognition model is formed by two layers of feedforward neural networks, and carrying out normalization operation on a recognition result through a normalization exponential function softmax to obtain all entities and corresponding entity types in each sentence;
the method comprises the steps of obtaining a plurality of entity pairs by pairing entities in the same sentence, recoding vectors of the entity pairs, adding a main guest boundary identifier and an entity type identifier of each entity, extracting attribute characteristics of each entity, adding the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity into the vectors of the entity pairs, obtaining semantic vectors of the main guest boundary identifier, the entity type identifier and the attribute characteristics of each entity, and outputting the encoded vectors of the entity pairs;
inputting the coded entity pair vector into a trained relation extraction model based on a neural network, and recording the relation type with the largest output probability after the softmax layer as the relation among the entity pairs.
2. The method of claim 1, wherein word vectors for words are obtained by a pre-trained language model BERT.
3. The method of claim 1, wherein semantic vectors of host-guest boundary identifications, entity type identifications, and attribute features of an entity are obtained by a pre-trained language model BERT.
4. The method of claim 1, wherein the entity types include a general class, a network security personnel class, a network security organization class, a network security asset class, a network security system class, and a network security resource class.
5. The method according to claim 1 or 4, wherein each entity type constitutes a set of entity types, and the set of entity types further includes a non-deterministic entity type item for expansion according to the actually recognized entity types that are not known; and the entity relation set is formed by the relation among the entity pairs and the relation among different entity types, and the entity relation set also comprises a non-determined entity relation item which is used for expanding according to the actually identified entity relation which does not belong to the known entity relation.
6. The method of claim 5, wherein the entity recognition model filters and recognizes the undetermined segments, screens out and determines the first several segments with the highest occurrence frequency in the actual scene, and expands the entity type set as the non-determined entity type item according to the network security entity recognition requirement.
7. The method of claim 1, wherein in the two-layer feedforward neural network of the entity identification model, the activation function of the first hidden layer is a linear rectification function, and the number of neurons of the second hidden layer is the same as the number of types in the entity type set.
8. The method of claim 1, wherein adding the host-guest boundary identification of the entity refers to labeling the start word and the end word in the host and the guest, specifically adding corresponding identification symbols to word vectors of the start word and the end word of the host, and adding corresponding identification symbols to word vectors of the start word and the end word of the guest.
9. An entity-relationship extraction apparatus in the field of network security, comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the steps of the method of any of claims 1-8 when the program is executed.
10. A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the method of any one of claims 1-8.
CN202210141506.9A 2022-02-16 2022-02-16 Entity relation extraction method and device in network security field Pending CN116662557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210141506.9A CN116662557A (en) 2022-02-16 2022-02-16 Entity relation extraction method and device in network security field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210141506.9A CN116662557A (en) 2022-02-16 2022-02-16 Entity relation extraction method and device in network security field

Publications (1)

Publication Number Publication Date
CN116662557A true CN116662557A (en) 2023-08-29

Family

ID=87712285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210141506.9A Pending CN116662557A (en) 2022-02-16 2022-02-16 Entity relation extraction method and device in network security field

Country Status (1)

Country Link
CN (1) CN116662557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881914A (en) * 2023-09-06 2023-10-13 国网思极网安科技(北京)有限公司 File system operation processing method, system, device and computer readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881914A (en) * 2023-09-06 2023-10-13 国网思极网安科技(北京)有限公司 File system operation processing method, system, device and computer readable medium
CN116881914B (en) * 2023-09-06 2023-11-28 国网思极网安科技(北京)有限公司 File system operation processing method, system, device and computer readable medium

Similar Documents

Publication Publication Date Title
US11799823B2 (en) Domain name classification systems and methods
CN108319666A (en) A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion
CN107229627B (en) Text processing method and device and computing equipment
CN109831460B (en) Web attack detection method based on collaborative training
CN101799802B (en) Method and system for extracting entity relationship by using structural information
CN109800304A (en) Processing method, device, equipment and the medium of case notes
CN108304424B (en) Text keyword extraction method and text keyword extraction device
CN106844640A (en) A kind of web data analysis and processing method
CN113949582B (en) Network asset identification method and device, electronic equipment and storage medium
CN112149386A (en) Event extraction method, storage medium and server
CN112989348A (en) Attack detection method, model training method, device, server and storage medium
CN116150651A (en) AI-based depth synthesis detection method and system
CN116186759A (en) Sensitive data identification and desensitization method for privacy calculation
US20170229118A1 (en) Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system
CN116662557A (en) Entity relation extraction method and device in network security field
CN110688515A (en) Text image semantic conversion method and device, computing equipment and storage medium
KR102425525B1 (en) System and method for log anomaly detection using bayesian probability and closed pattern mining method and computer program for the same
CN110866172B (en) Data analysis method for block chain system
CN107291685B (en) Semantic recognition method and semantic recognition system
US20220377095A1 (en) Apparatus and method for detecting web scanning attack
CN115759081A (en) Attack mode extraction method based on phrase similarity
CN115169293A (en) Text steganalysis method, system, device and storage medium
US20220043934A1 (en) System and method for entity resolution of a data element
KR102228873B1 (en) Construction system of criminal suspect knowledge network using public security information and Method thereof
CN109145293B (en) Case-oriented keyword extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination