Disclosure of Invention
The inventor finds that entity relations often appear in N continuous sentences in a network threat information document in the research process, and can realize the entity relation extraction at the network threat information document level by extracting the entity relations among the N continuous sentences.
In order to achieve the above object, the present application provides the following technical solutions:
a network threat intelligence document entity relation extraction method comprises the following steps:
acquiring a target document and a target entity set; the target document is a network threat intelligence document of an entity relation to be extracted; the target entity set comprises a plurality of target entities, and the target entities are preset text fields in sentences included in the target documents;
carrying out sentence splitting on the target document to obtain a sentence sequence corresponding to the target document;
respectively constructing every N continuous sentences in the sentence sequence into a sentence set corresponding to every N continuous sentences; n is a positive integer;
determining each target entity corresponding to each sentence set based on the target entity sets;
for each sentence set, performing pairwise combination on each target entity corresponding to the sentence set to generate at least one entity combination;
and aiming at each entity combination, processing the entity combination and a sentence set corresponding to the entity combination by utilizing a pre-constructed entity relationship extraction model to obtain an entity relationship result corresponding to the entity combination.
Optionally, in the foregoing method, the determining, based on the target entity sets, each target entity corresponding to each sentence set includes:
determining sentences to which each target entity in the target entity set belongs;
for each sentence set, determining each sentence included in the sentence set, and determining each target entity corresponding to each sentence included in the sentence set based on each sentence included in the sentence set and the sentence to which each target entity in the target entity set belongs;
and for each sentence set, determining each target entity corresponding to each sentence in the sentence set as the target entity corresponding to the sentence set.
Optionally, in the method, the entity relationship extraction model includes an embedded representation processing network, a bidirectional long and short term memory network, a graph convolution neural network, a multilayer perceptron, a concatenation network, and an extraction model, and the processing of the entity combination and the sentence set corresponding to the entity combination by using the pre-constructed entity relationship extraction model includes:
embedding representation processing is respectively carried out on the sentence sets corresponding to the entity combinations and each target entity in the entity combinations by utilizing an embedding representation processing network, so that the embedding representation of the sentence sets and the embedding representation of each target entity in the entity combinations are obtained;
respectively inputting the embedded representation of the sentence set and the embedded representation of each target entity in the entity combination into the bidirectional long-short term memory network to obtain the context information of the sentence set and the context information of each target entity in the entity combination;
respectively inputting the context information of the sentence set and the context information of each target entity in the entity combination into the graph convolution neural network to obtain a set representation of the sentence set and an entity representation of each target entity in the entity combination;
splicing the set representation of the sentence set and the entity representation of each target entity in the entity combination by using the splicing network to obtain a splicing result;
inputting the splicing result into the multilayer perceptron to obtain a representation vector corresponding to the splicing result;
and inputting the expression vector corresponding to the splicing result into the extraction model to obtain an entity relation result corresponding to the entity combination.
Optionally, in the foregoing method, the embedding representation processing network is used to perform embedding representation processing on the sentence set corresponding to the entity combination to obtain an embedded representation of the sentence set, where the embedding representation includes:
acquiring word embedded representation, part-of-speech embedded representation and attribute label embedded representation of each word in a sentence set corresponding to the entity combination;
aiming at each word in the sentence set corresponding to the entity combination, embedding the embedded representation, the part of speech embedded representation and the attribute label embedded representation of the word to form the embedded representation of the word;
and forming the embedded representation of the sentence set corresponding to the entity combination by using the embedded representation of each word in the sentence set corresponding to the entity combination.
Optionally, the method described above includes a process of constructing the entity relationship extraction model, including:
collecting a sample data set; the sample data set comprises a plurality of groups of sample data, the sample data comprises a network threat intelligence document sample sentence set, a sample entity pair and an entity relationship result, the sample entity pair comprises two sample entities, and the sample entities are preset text fields in the network threat intelligence document sample sentence set in the sample data to which the sample entities belong;
selecting a plurality of groups of sample data from the sample data set to form a training data set, and selecting a plurality of groups of sample data from the sample data set to form a test data set;
training an initial entity relationship extraction model by using the training data set;
and testing the trained initial entity relationship extraction model by using a test data set, and if the error rate of the trained initial entity relationship extraction model is less than a preset threshold value, taking the trained initial entity relationship extraction model as the entity relationship extraction model.
A network threat intelligence document entity relation extraction apparatus includes:
the acquisition unit is used for acquiring a target document and a target entity set; the target document is a network threat intelligence document of an entity relation to be extracted; the target entity set comprises a plurality of target entities, and the target entities are preset text fields in sentences included in the target documents;
the splitting unit is used for splitting sentences of the target document to obtain a sentence sequence corresponding to the target document;
the construction unit is used for respectively constructing every N continuous sentences in the sentence sequence into sentence sets corresponding to every N continuous sentences; n is a positive integer;
a determining unit, configured to determine, based on the target entity sets, target entities corresponding to each sentence set;
the combination unit is used for combining every two target entities corresponding to the sentence sets aiming at each sentence set to generate at least one entity combination;
and the extraction unit is used for processing the entity combination and the sentence set corresponding to the entity combination by utilizing a pre-constructed entity relationship extraction model aiming at each entity combination to obtain an entity relationship result corresponding to the entity combination.
In the above apparatus, optionally, the determining unit determines, based on the target entity sets, each target entity corresponding to each sentence set, and is configured to:
determining sentences to which each target entity in the target entity set belongs;
for each sentence set, determining each sentence included in the sentence set, and determining each target entity corresponding to each sentence included in the sentence set based on each sentence included in the sentence set and the sentence to which each target entity in the target entity set belongs;
and for each sentence set, determining each target entity corresponding to each sentence in the sentence set as the target entity corresponding to the sentence set.
Optionally, in the apparatus described above, the entity relationship extraction model includes an embedded representation processing network, a bidirectional long and short term memory network, a graph convolution neural network, a multilayer perceptron, a stitching network, and an extraction model, and the extraction unit includes:
the processing subunit is configured to perform embedded representation processing on the sentence sets corresponding to the entity combinations and each target entity in the entity combinations by using an embedded representation processing network, so as to obtain embedded representations of the sentence sets and embedded representations of each target entity in the entity combinations;
a first input subunit, configured to input the embedded representation of the sentence set and the embedded representation of each target entity in the entity combination into the bidirectional long and short term memory network, respectively, to obtain context information of the sentence set and context information of each target entity in the entity combination;
a second input subunit, configured to input context information of the sentence set and context information of each target entity in the entity combination into the graph convolution neural network, respectively, to obtain a set representation of the sentence set and an entity representation of each target entity in the entity combination;
a splicing subunit, configured to splice, by using the splicing network, the set representation of the sentence set and the entity representation of each target entity in the entity combination to obtain a splicing result;
the third input subunit is configured to input the splicing result into the multilayer perceptron to obtain a representation vector corresponding to the splicing result;
and the fourth input subunit is used for inputting the expression vector corresponding to the splicing result into the extraction model to obtain an entity relationship result corresponding to the entity combination.
Optionally, in the apparatus described above, the processing subunit performs, by using an embedded representation processing network, embedded representation processing on a sentence set corresponding to the entity combination, to obtain an embedded representation of the sentence set, where the embedded representation is used to:
acquiring word embedded representation, part-of-speech embedded representation and attribute label embedded representation of each word in a sentence set corresponding to the entity combination;
aiming at each word in the sentence set corresponding to the entity combination, embedding the embedded representation, the part of speech embedded representation and the attribute label embedded representation of the word to form the embedded representation of the word;
and forming the embedded representation of the sentence set corresponding to the entity combination by using the embedded representation of each word in the sentence set corresponding to the entity combination.
The above apparatus, optionally, further includes:
the acquisition unit is used for acquiring a sample data set; the sample data set comprises a plurality of groups of sample data, the sample data comprises a network threat intelligence document sample sentence set, a sample entity pair and an entity relationship result, the sample entity pair comprises two sample entities, and the sample entities are preset text fields in the network threat intelligence document sample sentence set in the sample data to which the sample entities belong;
the selecting unit is used for selecting a plurality of groups of sample data from the sample data set to form a training data set and selecting a plurality of groups of sample data from the sample data set to form a test data set;
the training unit is used for training an initial entity relationship extraction model by using the training data set;
and the testing unit is used for testing the trained initial entity relationship extraction model by using the testing data set, and if the error rate of the trained initial entity relationship extraction model is less than a preset threshold value, the trained initial entity relationship extraction model is used as the entity relationship extraction model.
A storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the network threat intelligence document entity relation extraction method.
An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to perform the above-mentioned cyber-threat intelligence document entity relationship extraction method.
Compared with the prior art, the method has the following advantages:
the application provides a method and a device for extracting entity relation of a network threat intelligence document, wherein the method comprises the following steps: acquiring a target document and a target entity set; splitting sentences of a target document to obtain a sentence sequence corresponding to the target document; respectively constructing every N continuous sentences in the sentence sequence into a sentence set corresponding to every N continuous sentences; determining each target entity corresponding to each sentence set based on the target entity sets; for each sentence set, combining every two target entities corresponding to the sentence set to generate at least one entity combination; and aiming at each entity combination, processing the entity combination and the sentence set corresponding to the entity combination by utilizing a pre-constructed entity relationship extraction model to obtain an entity relationship result corresponding to the entity combination. In the technical scheme, the sentence is split from the target document, every N continuous sentences are constructed into sentence sets corresponding to every N continuous sentences, and the entity relationship between any two entities in each sentence set is extracted by using the entity relationship extraction model, so that the entity relationship extraction of the network threat intelligence document level is realized, the problem that the key relationship cannot be effectively extracted due to the overlong threat intelligence text is solved, and the block chain network threat analysis capability is improved.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The embodiment of the application provides a method for extracting an entity relationship of a network threat intelligence document, which can be applied to various system platforms, wherein an execution main body of the method can run on a computer terminal or a processor of various mobile devices, and a method flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
s101, acquiring a target document and a target entity set.
And optionally, the process of obtaining the target document may be to crawl the network threat intelligence document and determine the crawled network threat intelligence document as the target document.
And acquiring a target entity set based on the acquired target document, wherein the target entity set comprises a plurality of target entities, and the target entities are preset text fields in sentences included in the target document.
S102, sentence splitting is carried out on the target document, and a sentence sequence corresponding to the target document is obtained.
And identifying each symbol representing the sentence end in the target document, splitting the target document into sentences based on the identified symbols, and forming a sentence sequence by each split sentence according to the sequence of each sentence in each sentence of the target document.
S103, constructing every N continuous sentences in the sentence sequence into a sentence set corresponding to every N continuous sentences respectively.
Specifically, every N continuous sentences in the sentence sequence are constructed into a sentence set corresponding to every N continuous sentences, wherein N is a positive integer, optionally N can be 3, that is, every 3 continuous sentences in the sentence sequence are constructed into a sentence set corresponding to every 3 continuous sentences.
The example of constructing every N consecutive sentences in the sentence sequence into a sentence set corresponding to every N consecutive sentences is as follows:
if the sentence sequence is { S1, S2, S3, S4, S5, S6}, where S1-S6 are sentences, and N is 3, S1, S2, and S3 are constructed as a sentence set S1, S2, S3, and S4 are constructed as a sentence set S2, S3, S4, and S5 are constructed as a sentence set S3, and S4, S5, and S6 are constructed as a sentence set S4.
And S104, determining each target entity corresponding to each sentence set based on the target entity sets.
In the method provided by the embodiment of the present application, the target entity set includes a plurality of target entities, and each target entity is a preset text field in a sentence included in the target document, that is, each target entity is a preset text field in a sentence, that is, each target entity corresponds to a certain sentence belonging to the target document.
The sentence set comprises N sentences, and each target entity corresponding to each sentence set is determined based on the target entity set.
Referring to fig. 2, the process of determining the target entities corresponding to each sentence set based on the target entity sets includes:
s201, determining sentences to which each target entity in the target entity set belongs.
And determining the sentence to which each target entity in the target entity set belongs, namely determining each target entity corresponding to each sentence in the target document.
S202, determining each sentence included in the sentence set aiming at each sentence set, and determining each target entity corresponding to each sentence included in the sentence set based on each sentence included in the sentence set and the sentence to which each target entity in the target entity set belongs.
For each sentence set, determining each sentence included in the sentence set, and determining each target entity corresponding to each sentence included in the sentence set based on the sentences included in the sentence set and the sentences to which each target entity in the target entity set belongs, that is, determining each target entity corresponding to each sentence included in the sentence set based on each target entity corresponding to each sentence in the target document.
S203, aiming at each sentence set, determining each target entity corresponding to each sentence in the sentence set as a target entity corresponding to the sentence set.
For each sentence set, each target entity corresponding to each sentence included in the sentence set is determined as the target entity corresponding to the sentence set.
The process of determining each target entity corresponding to each sentence set based on the target entity sets mentioned in the embodiments of the present application is exemplified as follows:
the target entity set comprises target entities a, b, c, d, e, f, g, h, i, j, k and l, wherein the target entities a and b belong to a sentence s1 in the target document, the target entities c and d belong to a sentence s2 in the target document, the target entities e and f belong to a sentence s3 in the target document, the target entities g and h belong to a sentence s4 in the target document, the target entities i and j belong to a sentence s5 in the target document, and the target entities l and m belong to a sentence s6 in the target document; for the sentence set S1{ S1, S2, S3}, the target entities corresponding to the sentence set S1 are a, b, c, d, e and f; for the sentence set S2{ S2, S3, S4}, the target entities corresponding to the sentence set S2 are c, d, e, f, g, and h; for the sentence set S3{ S3, S4, S5}, the target entities corresponding to the sentence set S3 are e, f, g, h, i and j; for the sentence set S4{ S4, S5, S6}, the target entities corresponding to the sentence set S4 are g, h, i, j, k, and l.
And S105, aiming at each sentence set, pairwise combining the target entities corresponding to the sentence sets to generate at least one entity combination.
And combining every two target entities corresponding to each sentence set to generate at least one entity combination corresponding to each sentence set. For example, the target entities corresponding to the sentence set S4 are g, h, i, j, k, and l, and the combinations obtained by combining two sentences include (g, h), (g, i), (g, j), (g, k), (g, l), (h, i), (h, j), (h, k), (h, l), (i, j), (i, k), (i, l), (j, k), (j, l), (k, l).
And S106, aiming at each entity combination, processing the entity combination and the sentence set corresponding to the entity combination by using a pre-constructed entity relationship extraction model to obtain an entity relationship result corresponding to the entity combination.
In the method provided by the embodiment of the present application, an entity relationship extraction model is pre-constructed, and referring to fig. 3, the construction process of the entity relationship model specifically includes the following steps:
s301, collecting a sample data set.
The method comprises the steps of obtaining a network threat intelligence document sample, obtaining a sample data set based on the network threat intelligence document sample, wherein the sample data set comprises a plurality of groups of sample data, the sample data comprises a network threat intelligence document sample sentence set, a sample entity pair and an entity relationship result, namely the group of sample data comprises a network threat intelligence document sample sentence set, a sample entity pair and an entity relationship result corresponding to the sample entity, and the network threat intelligence document sample sentence set is a set obtained by combining sentences included in the network threat intelligence document sample.
The way to combine the sentences included in the cyber threat intelligence document sample is: and carrying out sentence splitting on the network threat intelligence document sample to obtain a sample sentence sequence, and constructing every N continuous sentences in the sample sentence sequence into a network threat intelligence document sample sentence set corresponding to every N continuous sentences.
And the sample entity pair is obtained by combining every two sample entities corresponding to the network threat intelligence document sample sentence set. The sample entity pair comprises two sample entities, wherein the sample entities are preset text fields in a network threat intelligence document sample sentence set in sample data to which the sample entities belong.
S302, selecting a plurality of groups of sample data from the sample data set to form a training data set, and selecting a plurality of groups of sample data from the sample data set as a test data set.
Selecting a plurality of groups of sample data from the sample data set to form a training data set, and selecting a plurality of groups of sample data from the sample data set to form a test data set, wherein the number of the sample data in the training data set and the number of the sample data in the test data set are equal to the number of the sample data in the sample data set.
S303, training the initial entity relationship extraction model by using the training data set.
An initial entity relationship extraction model is constructed in advance, and the initial entity relationship model comprises an initial embedded representation processing network, an initial bidirectional long-short term memory network, an initial graph convolutional neural network, an initial multilayer perceptron, an initial splicing network and an initial prompt model.
And training the initial entity relationship extraction model by using each group of sample data in the training data set.
S304, testing the trained initial entity relationship extraction model by using the test data set, and if the error rate of the trained initial entity relationship extraction model is smaller than a preset threshold value, taking the trained initial entity relationship extraction model as the entity relationship extraction model.
Testing the initial entity relationship extraction model trained by each group of test data in the test data set to obtain a test result, calculating the error rate of the trained initial entity relationship extraction model based on the test result, if the error rate of the trained initial entity relationship extraction model is greater than a preset threshold value, determining the trained initial entity relationship extraction model as an entity relationship extraction model, namely determining the trained initial embedded representation processing network as an embedded representation processing network, determining the trained initial bidirectional long-short term memory network as a bidirectional long-short term memory network, determining the trained initial graph convolutional neural network as a graph convolutional neural network, determining the trained initial multilayer perceptron as a multilayer perceptron, determining the trained initial spliced network as a spliced network, and determining the trained initial extraction model as an extraction model.
In the method provided by the embodiment of the application, for each entity combination, a pre-constructed entity relationship extraction model is used to process the entity combination and a sentence set corresponding to the entity combination, so as to obtain an entity relationship result corresponding to the entity combination. Namely, an entity relationship extraction model is utilized to extract the entity relationship between each entity combination in the sentence set, so as to obtain an entity relationship result.
In the method provided in the embodiment of the present application, the entity relationship extraction model includes an embedded representation processing network, a bidirectional long and short term memory network, a multilayer perceptron, a mosaic network, and an extraction model, referring to fig. 4, the process of processing the entity combination and the sentence set corresponding to the entity combination by using the pre-constructed entity relationship extraction model to obtain the entity relationship result corresponding to the entity combination specifically includes:
s401, embedding representation processing is respectively carried out on the sentence sets corresponding to the entity combinations and each target entity in the entity combinations by utilizing an embedding representation processing network, and embedding representations of the sentence sets and each target entity in the entity combinations are obtained.
And carrying out embedded representation processing on the sentence set corresponding to the entity combination by using an embedded representation processing network to obtain embedded representation of the sentence set, namely inputting the sentence set into the embedded representation network, processing the sentence set through the embedded representation network, and embedding the embedded representation of the sentence set input by the network.
And carrying out embedded representation processing on the entity combination by using the embedded representation processing network to obtain an embedded representation of each target entity in the entity combination, namely, respectively inputting each target entity in the entity combination into the embedded representation network, and processing the target entity through the embedded representation network to obtain the embedded representation of the target entity.
In the method provided in the embodiment of the present application, a process of performing embedded representation processing on a sentence set corresponding to the entity combination by using an embedded representation processing network to obtain an embedded representation of the sentence set specifically includes:
acquiring word embedded representation, part-of-speech embedded representation and attribute label embedded representation of each word in a sentence set corresponding to the entity combination;
aiming at each word in the sentence set corresponding to the entity combination, embedding the embedded representation, the part of speech embedded representation and the attribute label embedded representation of the word to form the embedded representation of the word;
and forming the embedded representation of the sentence set corresponding to the entity combination by using the embedded representation of each word in the sentence set corresponding to the entity combination.
In the method provided by the embodiment of the application, the entity group is obtainedThe method comprises the steps of combining a word embedding representation, a part of speech embedding representation and an attribute tag embedding representation of each word included in a corresponding sentence set, forming the word embedding representation by combining the word embedding representation, the part of speech embedding representation and the attribute tag embedding representation of each word, forming the embedding representation of each word in the sentence set, and forming the embedding representation of the sentence set by combining the embedding representation of each word in the sentence set. For example, the word-embedding representation t of the ith word in the sentence seti=Gi+Pi+EiWherein G isiWord-embedded representation, P, representing the ith wordiPart-of-speech embedded representation representing the ith word, EiThe attribute tag embedding representation representing the ith word, and thus the embedding of the sentence set is represented as { t }1,t2,t3,t4,t5…tn}。
S402, respectively inputting the embedded representation of the sentence set and the embedded representation of each target entity in the entity combination into the bidirectional long-short term memory network to obtain the context information of the sentence set and the context information of each target entity in the entity combination.
The embedded expression of the sentence set is input into the bidirectional long and short term memory network, and processed by the bidirectional long and short term memory network to obtain the context information of the sentence set output by the bidirectional long and short term memory network,
it should be noted that the context information of the sentence sets is composed of the context information of each word included in the sentence sets, and the context information of the word is composed of the hidden layer state representation of the forward long-short term memory network and the hidden layer state representation of the backward long-short term memory network.
For example, contextual information for a set of sentences is obtained
S L={
l 1,
l 2,
l 3…
l nAnd (c) the step of (c) in which,
l icontext information representing the ith word in the sentence set,
,
,
,
representing a hidden layer state representation of the forward long-short term memory network,
a hidden layer state representation of the backward long-short term memory network is represented.
And inputting the embedded representation of each target entity in the entity combination into the bidirectional long and short term memory network, and processing the embedded representation by the bidirectional long and short term memory network to obtain the context information of each target entity output by the bidirectional long and short term memory network.
And S403, respectively inputting the context information of the sentence sets and the context information of each target entity in the entity combination into a graph convolution neural network to obtain a set representation of the sentence sets and an entity representation of each target entity in the entity combination.
The context information of the sentence sets is input into the graph convolution neural network to obtain a set representation of the sentence sets output by the graph convolution neural network, and it should be noted that the set representation of the sentence sets includes syntax dependency information. Optionally, the graph convolution neural network provided in the embodiment of the present application includes two layers of networks and a pooling layer, and context information of the sentence set is processed by the two layers of networks in the graph convolution neural network, and a pooling process is performed on a processed result, so that dimension reduction is performed on an output dimension, and thus a set representation of the sentence set is obtained.
The hidden layer state formula of the graph convolution neural network is as follows:
where v represents the target node and n (v) represents the set of neighbor nodes of node v, including the v node itself.
Indicates that node v is at
lThe layer representation, W and b represent learned weights.
And inputting the context information of each target entity in the entity combination into the graph convolution neural network to obtain the entity representation of each target entity in the entity combination. The processing procedure of the context information of the target entity by the graph convolution neural network is similar to the above-mentioned processing procedure of the context information of the sentence set, and is not described herein again.
S404, splicing the set representation of the sentence set and the entity representation of each target entity in the entity combination by using a splicing network to obtain a splicing result.
And splicing the set representation of the sentence set and the entity representation of each target entity in the entity combination to obtain a splicing result.
S405, inputting the splicing result into the multi-layer perceptron to obtain a representation vector corresponding to the splicing result.
And inputting the splicing result into the multi-layer perceptron to obtain a representation vector corresponding to the splicing result. For example,
h set =MLP(G set ; E head ; E tail )wherein, in the step (A),h set the representation vector corresponding to the splicing result is represented,G set a collection representation representing a collection of sentences,E head an entity representation representing one of the target entities in the entity combination,E tail an entity representation representing another target entity in the entity combination.
And S406, inputting the expression vector corresponding to the splicing result into the extraction model to obtain an entity relation result corresponding to the entity combination.
And inputting the expression vectors corresponding to the splicing result into the extraction model, and obtaining an entity relationship result of the entity combination output by the extraction model after the processing of the extraction model, wherein if the entity relationship does not exist between the target entities in the entity combination, the extraction model outputs an entity relationship result representing that the entity relationship does not exist between the target entities in the entity combination, and if the entity relationship exists between the target entities in the entity combination, the extraction model directly outputs the entity relationship between the target entities.
The process of processing the expression vector corresponding to the splicing result by the extraction model comprises the following steps: and processing the expression vector corresponding to the splicing result by using a linear layer of the extraction model, and processing the processed result again by using a normalization index function softmax to obtain the probability that the entity combination is each preset entity relationship, and outputting the preset entity relationship corresponding to the maximum probability as the entity relationship result of the entity combination. Wherein the linear layer is formulated asO set =Wh set +bW is the weight vector and b is the offset.
In the method provided by the embodiment of the application, the syntactic dependency information set representation is extracted by using the graph convolution neural network, so that the characteristics of the sentence set and the characteristics of the target entity are accurately extracted.
The method for extracting the entity relationship of the network threat intelligence document, provided by the embodiment of the application, is used for acquiring a target document and a target entity set; splitting sentences of a target document to obtain a sentence sequence corresponding to the target document; respectively constructing every N continuous sentences in the sentence sequence into a sentence set corresponding to every N continuous sentences; determining each target entity corresponding to each sentence set based on the target entity sets; for each sentence set, combining every two target entities corresponding to the sentence set to generate at least one entity combination; and aiming at each entity combination, processing the entity combination and the sentence set corresponding to the entity combination by utilizing a pre-constructed entity relationship extraction model to obtain an entity relationship result corresponding to the entity combination. By applying the method for extracting the entity relationship of the network threat intelligence document, provided by the embodiment of the application, the sentence of the target document is split, every N continuous sentences are constructed into sentence sets corresponding to every N continuous sentences, and the entity relationship between any two entities in each sentence set is extracted by using an entity relationship extraction model, so that the entity relationship extraction of the network threat intelligence document level is realized. And the expression containing the syntactic dependency information set is extracted by using the graph convolutional neural network, so that the characteristics of the sentence set and the characteristics of the target entity are accurately extracted.
Corresponding to the method described in fig. 1, an embodiment of the present application further provides an apparatus for extracting an entity relationship of a cyber threat intelligence document, which is used to implement the method in fig. 1 specifically, and a schematic structural diagram of the apparatus is shown in fig. 5, and specifically includes:
an obtaining unit 501, configured to obtain a target document and a target entity set; the target document is a network threat intelligence document of an entity relation to be extracted; the target entity set comprises a plurality of target entities, and the target entities are preset text fields in sentences included in the target documents;
a splitting unit 502, configured to split a sentence of the target document to obtain a sentence sequence corresponding to the target document;
a constructing unit 503, configured to construct each N consecutive sentences in the sentence sequence into a sentence set corresponding to each N consecutive sentences; n is a positive integer;
a determining unit 504, configured to determine, based on the target entity sets, respective target entities corresponding to each sentence set;
a combining unit 505, configured to combine, for each sentence set, every two target entities corresponding to the sentence set to generate at least one entity combination;
an extracting unit 506, configured to, for each entity combination, process the entity combination and a sentence set corresponding to the entity combination by using a pre-constructed entity relationship extraction model, so as to obtain an entity relationship result corresponding to the entity combination.
The network threat intelligence document entity relation extraction device provided by the embodiment of the application divides sentences of a target document, constructs every N continuous sentences into sentence sets corresponding to every N continuous sentences, and extracts the entity relation between any two entities in each sentence set by using an entity relation extraction model, so that the entity relation extraction at the network threat intelligence document level is realized. And the expression containing the syntactic dependency information set is extracted by using the graph convolutional neural network, so that the characteristics of the sentence set and the characteristics of the target entity are accurately extracted.
In an embodiment of the application, based on the foregoing scheme, the determining unit 504 performs, based on the target entity sets, determining respective target entities corresponding to each sentence set, so as to:
determining sentences to which each target entity in the target entity set belongs;
for each sentence set, determining each sentence included in the sentence set, and determining each target entity corresponding to each sentence included in the sentence set based on each sentence included in the sentence set and the sentence to which each target entity in the target entity set belongs;
and for each sentence set, determining each target entity corresponding to each sentence in the sentence set as the target entity corresponding to the sentence set.
In an embodiment of the present application, based on the foregoing solution, the entity relationship extraction model includes an embedded representation processing network, a bidirectional long-short term memory network, a graph convolution neural network, a multi-layer perceptron, a stitching network, and an extraction model, and the extraction unit 506 is configured to:
the processing subunit is configured to perform embedded representation processing on the sentence sets corresponding to the entity combinations and each target entity in the entity combinations by using an embedded representation processing network, so as to obtain embedded representations of the sentence sets and embedded representations of each target entity in the entity combinations;
a first input subunit, configured to input the embedded representation of the sentence set and the embedded representation of each target entity in the entity combination into the bidirectional long and short term memory network, respectively, to obtain context information of the sentence set and context information of each target entity in the entity combination;
a second input subunit, configured to input context information of the sentence set and context information of each target entity in the entity combination into the graph convolution neural network, respectively, to obtain a set representation of the sentence set and an entity representation of each target entity in the entity combination;
a splicing subunit, configured to splice, by using the splicing network, the set representation of the sentence set and the entity representation of each target entity in the entity combination to obtain a splicing result;
the third input subunit is configured to input the splicing result into the multilayer perceptron to obtain a representation vector corresponding to the splicing result;
and the fourth input subunit is used for inputting the expression vector corresponding to the splicing result into the extraction model to obtain an entity relationship result corresponding to the entity combination.
In an embodiment of the application, based on the foregoing solution, the processing subunit performs an embedding representation processing on a sentence set corresponding to the entity combination by using an embedding representation processing network, to obtain an embedding representation of the sentence set, and is configured to:
acquiring word embedded representation, part-of-speech embedded representation and attribute label embedded representation of each word in a sentence set corresponding to the entity combination;
aiming at each word in the sentence set corresponding to the entity combination, embedding the embedded representation, the part of speech embedded representation and the attribute label embedded representation of the word to form the embedded representation of the word;
and forming the embedded representation of the sentence set corresponding to the entity combination by using the embedded representation of each word in the sentence set corresponding to the entity combination.
In an embodiment of the present application, based on the foregoing scheme, the method may further include:
the acquisition unit is used for acquiring a sample data set; the sample data set comprises a plurality of groups of sample data, the sample data comprises a network threat intelligence document sample sentence set, a sample entity pair and an entity relationship result, the sample entity pair comprises two sample entities, and the sample entities are preset text fields in the network threat intelligence document sample sentence set in the sample data to which the sample entities belong;
the selecting unit is used for selecting a plurality of groups of sample data from the sample data set to form a training data set and selecting a plurality of groups of sample data from the sample data set to form a test data set;
the training unit is used for training an initial entity relationship extraction model by using the training data set;
and the testing unit is used for testing the trained initial entity relationship extraction model by using the testing data set, and if the error rate of the trained initial entity relationship extraction model is less than a preset threshold value, the trained initial entity relationship extraction model is used as the entity relationship extraction model.
The embodiment of the application also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the equipment where the storage medium is located is controlled to execute the method for extracting the entity relationship of the network threat intelligence document.
The present embodiment further provides an electronic device, whose schematic structural diagram is shown in fig. 6, specifically including a memory 601, and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601 and configured to be executed by one or more processors 603 to perform the following operations according to the one or more instructions 602:
acquiring a target document and a target entity set; the target document is a network threat intelligence document of an entity relation to be extracted; the target entity set comprises a plurality of target entities, and the target entities are preset text fields in sentences included in the target documents;
carrying out sentence splitting on the target document to obtain a sentence sequence corresponding to the target document;
respectively constructing every N continuous sentences in the sentence sequence into a sentence set corresponding to every N continuous sentences; n is a positive integer;
determining each target entity corresponding to each sentence set based on the target entity sets;
for each sentence set, performing pairwise combination on each target entity corresponding to the sentence set to generate at least one entity combination;
and aiming at each entity combination, processing the entity combination and a sentence set corresponding to the entity combination by utilizing a pre-constructed entity relationship extraction model to obtain an entity relationship result corresponding to the entity combination.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The method and the device for extracting the entity relationship of the network threat intelligence document provided by the application are introduced in detail, specific examples are applied in the method to explain the principle and the implementation mode of the application, and the description of the embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.