CN110597998A

CN110597998A - Military scenario entity relationship extraction method and device combined with syntactic analysis

Info

Publication number: CN110597998A
Application number: CN201910653287.0A
Authority: CN
Inventors: 杨若鹏; 卢稳新; 鲁义威; 刘乾; 蒋序平; 张建军; 温鸿鹏
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2019-12-20

Abstract

The invention discloses a military scenario entity relationship extraction method and device combining syntactic analysis, wherein the method comprises the following steps: 1. predefining a target relationship type of a military scenario entity relationship extraction task; 2. constructing a training data set and a testing data set of the entity relationship extraction model; 3. parsing the linguistic data item by item, and filtering out sentence components which do not contribute to the extraction of the entity relationship; 4. converting sentence components reserved after syntactic parsing into vectorized word embedding by using a pre-trained word embedding matrix; 5. training the entity relationship extraction model by using vectorized training data; 6. and extracting entity relations of the military scenario texts to be processed. The military thought entity relationship extraction method combined with the syntactic analysis can effectively improve the calculation efficiency and the accuracy of the entity relationship extraction.

Description

Military scenario entity relationship extraction method and device combined with syntactic analysis

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an entity relationship extraction method and device for military scenario.

Background

The military idea is divided into basic idea and supplementary idea, is a practice document which is assumed and assumed according to the attempts, situations and development situations of both parties of the battle according to the training topic, and is a basic document which organizes and induces the military practice and operation. The military thought entity relationship is a basic information element of military thought data, is a basis for extracting, processing and analyzing the military thought data, aims to extract the military thought entity relationship, finds the entity relationship hidden in the military thought unstructured text, and extracts the entity relationship by adopting a certain means.

At present, entity relationship extraction methods in the open field mainly include a rule-based method, a kernel function-based method, and a deep learning-based method. The rule-based method needs to depend on expert knowledge and manual induction seriously according to domain knowledge related to the linguistic data to be processed, so that the cost is high, the portability is poor, and the rule-based method is difficult to widely use; the method based on the kernel function performs entity relation extraction by calculating the similarity of the syntactic structure tree, so that the training and testing speed is too low, and the method is not suitable for processing large-scale data; the deep learning-based method can automatically extract high-level features in sentences by utilizing a deep neural network, has strong portability and high extraction precision, but for the text in the closed field planned by military, the performance of the text is restricted due to the lack of large-scale manual labeling linguistic data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and realize a military scenario entity relationship extraction method and device combined with syntactic analysis.

In order to achieve the purpose, the invention adopts the following technical scheme:

a military scenario entity relationship extraction method based on syntactic analysis and a deep neural network comprises the following steps:

s1, corpus construction, which is used for predefining entity relationship extraction target relationship types, labeling military scenario original texts, constructing entity relationship extraction model training data sets and testing data sets, and specifically comprises the following steps:

s1.1, predefining entity relations, wherein the entity relations are used for analyzing military concepts in an authoritative dictionary in the field, referring to the principle and method of a Semantic Evaluation conference about entity relation type definition, and predefining entity relation types to be extracted;

authoritative dictionaries in the field include, but are not limited to, dictionaries such as Chinese military encyclopedia, military dictionary, concise military dictionary and the like;

s1.2, entity relation linguistic data are constructed, military scenario original texts are labeled manually according to predefined entity relation types, an entity relation extraction corpus is generated, and the storage form of each linguistic data in the corpus is (e)₁，e₂R, s) in which e₁、e₂Respectively representing a head entity and a tail entity, r representing a semantic relationship between the two entities, s representing a description entity e₁、e₂Sentences with semantic relation r;

s1.3, dividing a data set, dividing a training data set and a test data set, and dividing the corpus obtained in the step S1.2 into the training data set and the test data set according to a specific proportion;

the division ratio of the training data set to the test data set is 2: 1.

S2, parsing, which is used to parse the sentence S in each corpus in the corpus and filter the sentence components that do not contribute to the entity relationship extraction, and specifically includes:

s2.1, generating a syntax tree, and analyzing sentences S in each corpus in the corpus by using a syntax analysis open source tool to generate the syntax tree;

the syntax parsing open source tool includes but is not limited to Stanford parser and the like;

s2.2, parsing tree pruning for pruning triples (e) in the syntax tree related to the entity₁，e₂R) generating a syntactic parse subtree by the irrelevant sentence components;

and S2.3, recombining the subtrees, namely recombining the syntax analysis subtrees into a text sequence, wherein the original sequence of words is not changed in the recombining process.

S3, vectorizing data, converting the recombined sequence generated in step S2.3 into a word embedding set expressed in a distributed vector form, specifically including:

s3.1, training original text vectorization, combining with an authority dictionary in the field to recombine the currently input sequence S_iConversion into one-hot vectors, s, in units of words_iRepresenting sentences in the input ith corpus;

s3.2, generating word embedding, namely converting the one-hot vector set obtained in the step S3.1 into low-dimensional real-value word embedding word by using a word vector conversion open source tool;

the word vector translation open source tool includes, but is not limited to, word2vec, etc.

S4, model training, namely training an entity relationship extraction model based on the deep neural network by using a datamation entity relationship extraction training data set, wherein the method specifically comprises the following steps:

s4.1, semantic feature extraction, namely selecting a specific neural network as a basic relation extractor, extracting high-level semantic features of the current sentence from the vector set output in the step S3.4, and simultaneously extracting an entity pair e by adopting a bidirectional neural network in the model₁、e₂The context semantic information of the ith corpus is used for improving the recognition accuracy of the entity relationship, and the characteristic expression of the jth word of the ith corpus is shown as the following formula:

in the formula (I), the compound is shown in the specification,a combination of a forward path output and a reverse path output]The representation is shown with a vector in parentheses,representing semantic features of the jth word in the ith corpus output from the forward channel,representing semantic features of a jth word in an ith corpus output by a backward channel;

the specific neural network includes, but is not limited to, Long Short-Term Memory Networks (LTSM) and the like;

the bidirectional neural network includes but is not limited to bidirectional long short term memory network (BLSTM) and the like;

s4.2, entity relation prediction, namely processing the characteristic vector output in the step S4.1 by using a classifier, and calculating the current corpus (e)₁，e₂R, s) where the relationship r is a predefined entity relationship type set Y ═ Y₁，y₂，…，y₈]Middle relation y_n(n∈[1，8]) Is estimated probability of

Wherein softmax (·) represents softmax classifier operation, W represents weight matrix of classifier network, and s_iRepresents a sentence in the ith corpus,representing the combination of the feature vectors of all the words of the sentences in the ith corpus, and b representing the bias of the classifier network;

estimating the relationship type corresponding to the maximum value in the probabilityNamely, the prediction result of the relation r in the current corpus is labeledTo show that:

in the formula (I), the compound is shown in the specification,it means that the maximum value is taken for operation,representing a sentence s in the ith corpus_iThe entity relationship type described is y_nConditional probability of (a), y_nRepresenting the nth predefined entity relationship type, s_iRepresenting sentences in the ith corpus;

the classifier includes but is not limited to a softmax classifier, etc.;

s4.3, optimizing a cost function, and obtaining the following cost function of the deep neural network by calculating the logarithm of the negative likelihood function of the real label y:

in the formula, t_nWhich represents a one-hot vector of the vector,the method comprises the steps of representing the estimation probability of each predefined relationship type output by a softamx classifier, representing the number of the predefined relationship types (the value is 8 here), representing L2 regularized superparameter, representing theta to an independent parameter in an entity relationship extraction model, representing | · | | | to obtain a norm, and continuously adjusting the model superparameter through minimizing a cost function J (theta) to finish model training.

S5, entity relationship extraction, which is used to extract entity relationships of the military scenario text to be processed by using the trained model, and specifically includes:

s5.1, testing text vectorization, and using the processing process in the step S3 to vectorize the military scenario original text to be processed sentence by sentence;

and S5.2, entity relation prediction, namely performing semantic relation prediction on the vectorized military scenario sentence by sentence output in the step S5.1 by using the model trained in the step S4, and storing the result.

The invention adopts the extraction method of the military scenario entity relationship by combining the syntactic analysis, and has the advantages that:

1. through deep analysis of authoritative dictionaries in the fields of Chinese military encyclopedia and the like, the target requirement of extraction of military thought entity relations is cleared, on the basis, the principle and the method of definition of entity relation types of a Semantic Evaluation conference are referred, the target relation types of extraction of 8 military thought entity relations are predefined, and a military thought entity relation extraction training/testing corpus containing 11236 corpora is constructed;

2. the semantic expression which is set for military affairs has stronger normativity and modularity, and the syntactic parser is firstly utilized to carry out syntactic parsing and pruning operation on the sentence before the relation extraction is carried out, so that sentence components which do not contribute to the entity relation extraction are filtered, the utilization rate of effective information is improved, and the operation overhead of the model is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of an embodiment of a military proposed entity relationship extraction method incorporating syntactic analysis of the present invention;

FIG. 2 is a block diagram of the component architecture of the present invention;

FIG. 3 is a diagram of an entity relationship extraction model based on a deep neural network applied in the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow diagram of a military scenario entity relationship extraction method with syntactic analysis is shown, which specifically includes the following steps:

s1.1, predefining entity relations, analyzing military concepts in authoritative dictionaries in fields of Chinese military encyclopedias, military major dictionaries, concise military dictionaries and the like, referring to principles and methods of a Semantic Evaluation conference about entity relation type definition, and predefining entity relation types to be extracted;

1.2, constructing entity relation corpora, marking military scenario original texts by adopting a manual method according to predefined entity relation types to generate an entity relation extraction corpus, wherein the storage form of each corpus in the corpus is (e)₁，e₂R, s) in which e₁、e₂Respectively representing a head entity and a tail entity; r represents the semantic relationship between two entities; s denotes a description entity e₁、e₂Sentences with semantic relation r;

s2.1, generating a syntax tree, and performing syntax analysis on sentences S in each corpus in the corpus by using open source tools such as Stanfordparser and the like to generate the syntax tree;

s3.2, generating word embedding, namely converting the one-hot vector set obtained in the step S3.1 into low-dimensional real-value word embedding word by utilizing open source tools such as word2vec and the like, namely, converting the jth word x in the ith sentence into low-dimensional real-value word embedding word_ijConversion to k-dimensional vectors

s4.1, semantic feature extraction, namely selecting a Long Short-Term Memory network (LTSM) and the like as a basic relation extractor, extracting high-level semantic features of the current sentence from the vector set output in the step S3.2, and simultaneously extracting an entity pair e by using a bidirectional Long Short-Term Memory network (BLSTM) and the like as a model₁、e₂The context semantic information of the entity relationship is improved, and the identification precision of the entity relationship is improved. Of the j-th word of the ith corpusThe characteristic expression is shown as the following formula:

s4.2, entity relation prediction, namely processing the feature vector output in the step S4.1 by using classifiers such as softmax and the like, and calculating the current corpus (e)₁，e₂R, s) where the relationship r is a predefined entity relationship type set Y ═ Y₁，y₂，…，y₈]Middle relation y_n(n∈[1，8]) Is estimated probability of

in the formula, t_nWhich represents a one-hot vector of the vector,and (2) representing the estimated probability of each predefined relationship type output by the softamx classifier in the step (S4.2), wherein m represents the number of the predefined relationship types (the value is 8), lambda represents the regularized hyperparameter of L2, theta represents an independent parameter in the entity relationship extraction model, and | L | · | | represents the norm, and model hyperparameter is continuously adjusted by minimizing a cost function J (theta) to complete model training.

S5, extracting entity relations, wherein the extraction of the entity relations is carried out on the military scenario text to be processed by utilizing the trained model, and the extraction method specifically comprises the following steps:

Referring to fig. 2, there is shown a composition structure diagram of the present invention, which specifically includes:

the corpus construction module 100 is configured to predefine an entity relationship extraction target relationship type, label a military scenario original text, and construct an entity relationship extraction model training data set and a test data set, and specifically includes:

the entity relationship predefining unit 101 is used for analyzing military concepts in authoritative dictionaries in fields of Chinese military encyclopedias, military dictionaries, concise military dictionaries and the like, predefining entity relationship types to be extracted by referring to principles and methods defined by a Semantic Evaluation conference about entity relationship types;

the entity relationship corpus building unit 102 labels military scenario original texts by a manual method according to predefined entity relationship types to generate an entity relationship extraction corpus;

a data set dividing unit 103, configured to divide a training data set and a test data set, and divide the corpus obtained by the entity-relationship corpus establishing unit 102 into the training data set and the test data set according to a specific ratio;

the syntax parsing module 200 is configured to perform syntax parsing on sentences in each corpus in the corpus, and filter out sentence components that do not contribute to entity relationship extraction, and specifically includes:

a syntax tree generating unit 201, which performs syntax parsing on sentences in each corpus in the corpus by using an open source tool to generate a syntax tree;

a syntax tree pruning unit 202, configured to prune branches and leaves in the syntax tree except for the entity and its root node, and generate a syntax parsing sub-tree;

and the subtree recombination unit 203 recombines the syntax analysis subtree into a text sequence, and does not change the original sequence of words in the recombination process.

The data vectorization module 300 converts the recombination sequences generated by the sub-tree recombination unit 203 into word embedding sets expressed in a distributed vector form, and specifically includes:

training an original text vectorization unit 301, segmenting the recombination sequence in the current input corpus according to words to obtain a word set consisting of T words, and converting the words in the set into one-hot vectors based on an authority dictionary in the field;

the word embedding generating unit 302 converts the one-hot vector set obtained by training the original text vectorization 301 into low-dimensional real-valued word embedding word by using an open-source tool.

The model training module 400, which trains the entity relationship extraction model based on the deep neural network by using the datamation entity relationship extraction training data set, specifically includes:

the semantic feature extraction unit 401 selects a specific neural network as a basic relationship extractor, extracts the high-level semantic features of the current sentence from the vector set output by the word embedding generation unit 302, and the model adopts a bidirectional neural network to simultaneously extract the entity pair e₁、e₂The context semantic information of the entity relation is improved so as to improve the identification precision of the entity relation;

an entity relationship prediction unit 402 for processing the feature vector output from the semantic feature extraction unit 401 by using a classifier;

the cost function optimization unit 403 obtains a cost function of the deep neural network by calculating the logarithm of the negative likelihood function of the real label y, and completes model training by continuously adjusting the hyper-parameters of the model by minimizing the cost function.

The entity relationship extraction module 500 performs entity relationship extraction on the military scenario text to be processed by using the trained model, and specifically includes:

the test text vectorization unit 501 performs vectorization on the military scenario original text to be processed sentence by using the processing procedure in the data vectorization module 300;

the entity relationship prediction unit 502 performs semantic relationship prediction on the vectorization military scenario sentence by sentence output by the test text vectorization unit 501 by using the model trained by the model training module 400, and stores the result.

Claims

1. A military scenario entity relationship extraction method combined with syntactic analysis is characterized by comprising the following steps:

s1, corpus construction: predefining entity relationship extraction target relationship types, labeling military scenario original texts, and constructing an entity relationship extraction model training data set and a test data set, wherein the method specifically comprises the following steps:

s1.1, predefining entity relations: predefining entity relationship types to be extracted by adopting the principle and method of definition of the entity relationship types of the Semantic Evaluation conference;

s1.2, entity relation corpus construction: marking military scenario original text by adopting a manual method according to a predefined entity relation type to generate an entity relation extraction corpus, wherein the storage form of each corpus in the corpus is (e)₁，e₂R, s) in which e₁、e₂Respectively representing a head entity and a tail entity, r representing a semantic relationship between the two entities, s representing a description entity e₁、e₂Sentences with semantic relation r;

s1.3, data set division: dividing a training data set and a test data set, and dividing the corpus obtained in the step S1.2 into the training data set and the test data set according to a specific proportion;

s2, syntax analysis: the method specifically comprises the following steps of performing syntactic analysis on a sentence s in each corpus in a corpus, and filtering out sentence components which do not contribute to entity relationship extraction, wherein the method specifically comprises the following steps:

s2.1, syntax tree generation: analyzing sentences s in each corpus in the corpus by using a syntax analysis open source tool to generate a syntax tree;

s2.2, pruning of the analytic tree: pruning triples (e) in syntax trees related to entities₁，e₂R) generating a syntactic parse subtree by the irrelevant sentence components;

s2.3, subtree recombination: the syntax analysis subtrees are recombined into a text sequence, and the original sequence of words is not changed in the recombination process;

s3.1, training original text vectorization: combining the currently input recombination sequence s with the authoritative dictionary in the field_iConversion into one-hot vectors, s, in units of words_iRepresenting sentences in the input ith corpus;

s3.2, word embedding generation: converting the one-hot vector set obtained in the step S3.1 into low-dimensional real-valued word embedding word by utilizing a word vector conversion and source-opening tool:

s4, model training: the method for training the entity relationship extraction model based on the deep neural network by utilizing the datamation entity relationship extraction training data set specifically comprises the following steps:

s4.1, semantic feature extraction: selecting a specific neural network as a basic relation extractor, extracting high-level semantic features of the current sentence from the vector set output in the step S3.4, and simultaneously extracting an entity pair e by adopting a bidirectional neural network in the model₁、e₂The context semantic information of the ith corpus is used for improving the recognition accuracy of the entity relationship, and the characteristic expression of the jth word of the ith corpus is shown as the following formula:

s4.2, entity relation prediction: processing the feature vector output in step S4.1 by a classifier to calculate the current corpus (e)₁，e₂R, s) where the relationship r is a predefined entity relationship type set Y ═ Y₁，y₂，…，y₈]Middle relation y_n(n∈[1，8]) Is estimated probability of

s4.3, cost function optimization: by calculating the logarithm of the negative likelihood function of the real label y, the cost function of the deep neural network is obtained as follows:

in the formula, t_nWhich represents a one-hot vector of the vector,representing the estimated probability of each predefined relationship type output by the softamx classifier in the step S4.2, wherein m represents the number of the predefined relationship types (the value is 8 here), lambda represents L2 regularized hyper-parameter, theta represents an independent parameter in the entity relationship extraction model, and | L | · | | | represents the norm, and model hyper-parameter is continuously adjusted through minimizing a cost function J (theta) to complete model training;

s5, entity relationship extraction: the method for extracting the entity relationship of the military scenario text to be processed by utilizing the trained model specifically comprises the following steps:

s5.1, testing text vectorization: vectorizing the military scenario original text to be processed sentence by using the processing procedure in the step S3;

s5.2, entity relation prediction: and (4) performing semantic relation prediction on the vectorized military scenario sentence by sentence output in the step (S5.1) by using the model trained in the step (S4), and storing the result.

2. The method of military affairs ideation entity relationship extraction combined with syntactic analysis according to claim 1, wherein the domain authority dictionary comprises military encyclopedia of China, military dictionary, and concise military dictionary.

3. The method of extracting military hypothetical entity relationships incorporating syntactic analysis according to claim 1, wherein the training data set is divided by the test data set in a ratio of 2: 1.

4. The method of extracting military tape-out entity relationships in conjunction with syntactic analysis according to claim 1, wherein said syntactic parse open source tool is a Stanford parser.

5. The method of extracting military hypothetical entity relationships incorporating syntactic analysis according to claim 1, wherein the word vector translation open source tool is word2 vec.

6. The method of extracting military hypothetical entity relationships incorporating syntactic analysis according to claim 1, wherein the specific neural network is a long-short term memory network.

7. The method of extracting military hypothetical entity relationships incorporating syntactic analysis according to claim 1, wherein the bidirectional neural network is a bidirectional long-short term memory network.

8. The method of extracting military proposal entity relationships in conjunction with syntactic analysis of claim 1 wherein the classifier comprises a softmax classifier.

9. An apparatus for extracting military proposed entity relationships in conjunction with syntactic analysis, the apparatus comprising:

corpus construction module 100: predefining entity relationship extraction target relationship types, labeling military scenario original texts, and constructing an entity relationship extraction model training data set and a test data set, wherein the method specifically comprises the following steps:

entity relationship pre-defining unit 101: predefining entity relationship types to be extracted by adopting the principle and method of definition of the entity relationship types of the Semantic Evaluation conference;

the entity relationship corpus building unit 102: marking military scenario original texts by adopting a manual method according to predefined entity relationship types to generate an entity relationship extraction corpus;

the data set dividing unit 103: dividing a training data set and a test data set, and dividing a corpus obtained by the entity relationship corpus construction unit 102 into the training data set and the test data set according to a specific proportion;

syntax parsing module 200: the method specifically comprises the following steps of performing syntactic analysis on sentences in each corpus in a corpus, and filtering sentence components which do not contribute to entity relationship extraction, wherein the method specifically comprises the following steps:

syntax tree generation unit 201: performing syntax analysis on sentences in each corpus in the corpus by using an open source tool to generate a syntax tree;

syntax tree pruning unit 202: cutting branches and leaves except entities and root nodes thereof in the syntax tree to generate a syntax analysis sub-tree;

the subtree recombination unit 203: the syntax analysis subtrees are recombined into a text sequence, and the original sequence of words is not changed in the recombination process;

the data vectorization module 300: converting the recombination sequences generated by the sub-tree recombination unit 203 into word embedding sets expressed in a distributed vector form, which specifically comprises:

training the original text vectorization unit 301: segmenting the recombination sequence in the current input corpus according to words to obtain a word set consisting of T words, and converting the words in the set into one-hot vectors based on an authority dictionary in the field;

the word embedding generation unit 302: converting the one-hot vector set obtained by the training original text vectorization unit 301 word by word into low-dimensional real-valued word embedding by using an open-source tool;

model training module 400: the method for training the entity relationship extraction model based on the deep neural network by utilizing the datamation entity relationship extraction training data set specifically comprises the following steps:

semantic feature extraction section 401: selecting a particular neural network as a basis relationship extractionA device for extracting high-level semantic features of the current sentence from the vector set output by the word embedding generation unit 302, wherein the model adopts a bidirectional neural network to simultaneously extract entity pairs e₁、e₂The context semantic information of the entity relation is improved so as to improve the identification precision of the entity relation;

entity relationship prediction unit 402: processing the feature vectors output by the semantic feature extraction unit 401 by using a classifier;

cost function optimization unit 403: obtaining a cost function of the deep neural network by calculating the logarithm of the negative likelihood function of the real label y, and continuously adjusting the hyper-parameters of the model by minimizing the cost function to finish model training;

the entity relationship extraction module 500: the method for extracting the entity relationship of the military scenario text to be processed by utilizing the trained model specifically comprises the following steps:

the test text vectorization unit 501: vectorizing the military scenario original text to be processed sentence by using the processing procedure in the data vectorization module 300;

entity relationship prediction unit 502: the model trained by the model training module 400 is used for carrying out semantic relation prediction on the vectorization military scenario sentence by sentence output by the test text vectorization unit 501, and the result is stored.