CN110598203A

CN110598203A - Military imagination document entity information extraction method and device combined with dictionary

Info

Publication number: CN110598203A
Application number: CN201910653281.3A
Authority: CN
Inventors: 蒋序平; 鲁义威; 杨若鹏; 张建军; 卢稳新; 朱巍; 刘乾
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2019-12-20
Anticipated expiration: 2039-07-19
Also published as: CN110598203B

Abstract

The invention discloses a method and a device for extracting military imagination document entity information by combining a dictionary, wherein the method comprises the following steps: 1. preprocessing, establishing a military scenario corpus set and a military scenario entity dictionary; 2. constructing a military scenario dictionary and a word vector matrix; 3. determining 14 types of military affairs thought entity types and semantic description rules thereof, selecting corpora for labeling, respectively establishing a training corpus set and a test corpus set, and preparing for model training; 4. establishing an entity information extraction model, and training entity information extraction model parameters; 5. and extracting military scenario entity information of the military scenario text data to be predicted. The military scenario entity information extraction-oriented method can effectively solve the problems of insufficient manual construction characteristics, strong word segmentation dependency and the like of military scenario entity information extraction, thereby improving the military scenario entity information extraction efficiency.

Description

Military imagination document entity information extraction method and device combined with dictionary

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a military-oriented entity information extraction method and device.

Background

The military scenario text is a description text which is assumed and assumed according to the attempts, situations and battle progress scenes of both fighters. The military scenario text entity information is a basic information element of military scenario data, is a basis for extracting, processing and analyzing the military scenario text data, aims to extract the military scenario text entity information, finds entities hidden in the military scenario unstructured and semi-structured text information, and extracts the entities by adopting a certain means.

At present, methods for identifying named entities in the general field mainly include rule-based methods, statistical and machine learning-based methods and deep learning-based methods. Among them, rule-based methods have high accuracy but high coverage, portability and development cost; the method based on statistics and machine learning has low development cost, but has strong dependency on feature engineering and Chinese word segmentation; the method based on deep learning is high in precision and strong in portability, but word segmentation is still needed for constructing word vectors, and the requirement on the corpus scale of the computing power is high.

In military thought entity information extraction, a rule and dictionary-based method is popular, semantic entities are extracted from military thought text data, a Conditional Random Field (CRF) model can be used for learning text features to identify entity information in a scene, and a method of combining multiple models (CRF and rule, CRF and dictionary and rule) can be used for identifying entity information. The traditional method has pertinence, but is slightly insufficient in recognition effect and expandability, is difficult to adapt to the change of military scenario information in the future and can not meet the requirements of automatic and intelligent processing of massive large data.

At present, military affairs plan entity information extraction mainly has the following problems:

1) under different scenes, a large number of entities exist in various forms such as combination, nesting, abbreviation and the like;

2) due to the difference of scene language style and habits, the number of certain entities is huge, the name forms are complex and changeable, and no strict and uniform rule exists, so that comprehensive and reasonable entity characteristics are difficult to construct;

3) the existing word segmentation tool is mainly suitable for the general field, the word segmentation accuracy rate of military scenario text data is not high, especially scene professional terms are rare in the general field, and all scene entities are difficult to contain even a scene dictionary is added, so that the recognition effect of the method with strong dependency on the word segmentation is difficult to break through the current bottleneck.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and aims at solving the practical problems of complex military scenario data, difficult manual acquisition and the like, a military scenario entity dictionary is established based on an authority dictionary in the military field, a training corpus set and a test corpus set are established by determining 14 types of military scenario entity types and semantic description rules thereof, and an entity information extraction model is trained, so that the method and the device for extracting the military scenario document entity information in combination with the dictionary are realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for extracting military scenario paperwork entity information in combination with a dictionary, the method comprising the steps of:

s1, preprocessing data, preprocessing military thought paperwork data, establishing a military thought entity dictionary, specifically comprising:

s1.1, establishing a corpus set, namely preprocessing military scenario data, removing meaningless symbols, performing clause division according to Chinese sentence break symbols and establishing the corpus set;

the Chinese sentence break symbol comprises. ""! "and the like;

s1.2, establishing a military thought entity dictionary, selecting an authority dictionary in the field according to the field related to military thought, collecting proper nouns from a power dictionary, establishing the military thought entity dictionary according to the types of a military category word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;

the authoritative dictionary in the field is published and issued for public publishing in the field, and widely recognized dictionaries are available, including but not limited to dictionaries of military encyclopedia of China, military dictionary and concise military dictionary.

S2, generating word vectors, constructing a military scenario dictionary and a word vector matrix according to a military scenario corpus and a military scenario entity dictionary, and specifically comprising the following steps:

s2.1, counting characters, namely counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number Z of the corpus dictionary;

s2.2, generating a vector matrix of the military imagination character, and generating an open-source tool training corpus dictionary by using the vector matrix of the character to obtain a multidimensional vector matrix of the military imagination character;

the word vector matrix generation open source tools include, but are not limited to, word2vec, glove, and the like.

S3, corpus labeling, determining a complete military scenario entity type definition rule by combining an authoritative dictionary and a corpus, selecting the corpus for labeling, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the method specifically comprises the following steps:

s3.1, military imagination entity type and semantic description rule determination, wherein the method comprises the steps of analyzing corpus content by combining an authoritative dictionary, consulting the opinions of multiple experts in the field, and determining 14 military imagination entity types and semantic description rules of three categories of entity names, time expressions and digital expressions;

s3.2, generating word labels, namely, assigning a uniform label to each word in the preprocessing result data of the step S1 by taking sentences as units by adopting a method of manually marking word attributes and automatically generating character labels;

s3.3, generating character labels, namely generating the character labels for the labeled texts by using an open source toolkit and adopting a specific labeling system;

the open source toolkit includes, but is not limited to, YEDDA, brat, etc.;

the specific labeling system comprises but is not limited to labeling systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a prefix character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a beginning character, an l label represents a middle character, an E label represents an end character, a W label represents a single character, and an O label represents a non-entity character.

S4, model training, which is used for establishing an entity information extraction model according to a military imagination dictionary and a word vector matrix, and training entity information extraction model parameters, and specifically comprises the following steps:

s4.1, text sequence segmentation, which segments an input text sequence in sentence units, where a sentence containing n words is expressed as X ═ X (X)₁，x₂，...，x_n) Based on the military scenario dictionary and the word vector matrix established in step S2, each character X of X is divided into_iConversion into a word vector matrix V of dimension w^w∈D^w×ZA word vector e in_i：

e_i＝V^w×zⁱ (1)

In the formula, vector zⁱFor the dimension z, the ith row takes 1, the other rows take 0 vectors, and the input sentence X becomes the character embedding word vector sequence E ═ E (E₁，e₂，...，e_n)；

S4.2, hidden state sequence generation, where the word vector sequence E generated in step S4.1 is equal to (E)₁，e₂，...，e_n) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence S^BiLSTM；

S4.3, optimal output tag sequence generation, willSequence S generated in step S4.2^BiLSTMInputting a Conditional Random Field (CRF) model to obtain a transition matrix A, and recording the tag sequence of sentence X as Y ═ Y₁，y₂，...，y_n) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:

in the formula, S^BiLSTMAs a hidden state sequence, y_iAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix during model training, and obtaining the optimal output label sequence when taking the maximum value.

S5, entity information extraction, which is used for applying the trained entity information extraction model to perform military scenario entity information extraction on text data to be predicted, and specifically comprises the following steps:

s5.1, text preprocessing is used for preprocessing input military scenario text data;

s5.2, vectorizing expression, namely vectorizing expression is carried out on the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step S2, and a trained model is input;

s5.3, acquiring entity information, calculating input sentence vectors by applying an entity information extraction model, and generating a sequence S1^BiLSTMInputting a Conditional Random Field (CRF) model to obtain a transition matrix a1, wherein the tag sequence of sentence S1 is Y1 ═ Y₁，y₂，...，y_n) The evaluation function of tag Y1 of sentence S1 to be extracted:

wherein, S1^BiLSTMAs a hidden state sequence, y_iAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix, obtaining the optimal output label sequence when the maximum value is taken, and extracting to obtain the entity information of S1.

The invention adopts the military idea document entity information extraction method combined with the dictionary, and has the advantages that:

1. the problems of insufficient manual construction characteristics, strong word segmentation dependency and the like in the extraction of military scenario document entity information are effectively solved;

2. the workload of military scenario data acquisition is greatly reduced;

3. the information extraction efficiency of military scenario document entities is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a military scenario paperwork entity information extraction method in combination with a dictionary;

fig. 2 is a block diagram of the composition structure of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow chart of the military scenario paperwork entity information extraction method combining a dictionary is shown, which comprises the following steps:

s1.1, establishing a corpus set, and preprocessing and removing military scenario data. ""! ' and the like, and carrying out sentence segmentation according to Chinese sentence-breaking symbols to establish a corpus;

s1.2, establishing a military imagination entity dictionary, selecting the fields related to the military imagination, publishing and issuing the dictionary in the fields in an open way to obtain a widely recognized dictionary, such as dictionaries of Chinese military encyclopedia, military dictionary, concise military dictionary and the like, collecting proper nouns from a Weiwei dictionary, establishing the military imagination entity dictionary according to the types of military categories, weapon equipment word stock, facility word stock and action word stock, and analyzing and labeling the semantic structure of the entity.

and S2.2, generating a military imagination word vector matrix, and generating an open source tool training corpus dictionary by using word vector matrices such as word2vec and glove to obtain a multidimensional military imagination word vector matrix.

s3.1, military imagination entity type and semantic description rule determination, combining an authoritative dictionary, analyzing corpus content, consulting multiple expert opinions in the field, determining three major categories of an entity name, a time expression and a digital expression, namely 14 military imagination entity types and semantic description rules, and showing in the following table:

s3.3, generating character labels, namely generating the character labels for the labeled texts by using open source toolkits such as YEDDA, brat and the like and adopting labeling systems such as BIO, BIEOS and the like, wherein B labels in the BIO system represent initial characters, I labels represent characters in words, and O labels represent non-entity characters; in the BMEWO system, a B label represents a beginning character, an I label represents a middle character, an E label represents an end character, a W label represents a single character, and an O label represents a non-entity character.

S4, model training, namely establishing an entity information extraction model based on a military imagination dictionary and a word vector matrix, and training entity information extraction model parameters, wherein the model training specifically comprises the following steps:

s4.1, text sequence segmentation, which segments an input text sequence in sentence units, where a sentence containing n words is expressed as X ═ X (X)₁，x₂，...，x_n) Based on the military imagination dictionary and the word vector matrix established in the step 2, each character X of the X is divided into three parts_iConversion into a word vector matrix V of dimension w^w∈D^w×ZA word vector e in_i：

e_i＝V^w×zⁱ (1)

S4.2, hidden state sequence generation, namely, the word vector sequence E generated in step 4.1 is equal to (E)₁，e₂，...，e_n) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence S^BiLSTM；

S4.3, generating an optimal output label sequence, and converting the sequence S generated in the step 4.2^BiLSTMInputting Conditional Random Fields (CRF)) Modeling, obtaining a transition matrix A, and recording the tag sequence of the sentence X as Y ═ Y₁，y₂，...，y_n) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:

S5, entity information extraction, wherein the trained entity information extraction model is applied to extract military scenario entity information of text data to be predicted, and the method specifically comprises the following steps:

s5.2, vectorizing expression, namely vectorizing expression is carried out on the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step 2, and a trained model is input;

Referring to fig. 2, a block diagram of an embodiment of an extracting device for military imagination document entity information combined with a dictionary according to the present invention is shown, which specifically includes the following components:

the data preprocessing module 100 is configured to preprocess military scenario document data and establish a military scenario entity dictionary, and specifically includes:

a corpus establishing unit 101, which preprocesses military scenario data, removes meaningless symbols, performs sentence division according to Chinese sentence break symbols, and establishes a corpus;

the military affairs thought entity dictionary establishing unit 102 selects an authority dictionary in the field according to the field related to the military affairs thought, collects proper nouns from the WEI dictionary, establishes a military affairs thought entity dictionary according to the types of military affairs word stock, weapon equipment word stock, facility word stock and action word stock, and analyzes and labels the semantic structure of the entity.

The word vector generation module 200 constructs a military scenario dictionary and a word vector matrix according to a military scenario corpus and a military scenario entity dictionary, and specifically comprises:

the character counting unit 201 is used for counting all characters appearing in the military thought entity dictionary and the domain authority dictionary, establishing a digital index for each character to obtain the military thought dictionary, and recording the total word number of the corpus dictionary;

the military imagination word vector matrix generation unit 202 generates an open source tool training corpus dictionary by using the word vector matrix to obtain a military imagination word vector matrix with a certain dimension.

The corpus labeling module 300, in combination with the authoritative dictionary and the corpus in the field, determines the complete military scenario entity type definition rule, selects the corpus to label, establishes a training corpus set and a testing corpus set respectively, prepares for model training, and specifically includes:

determining military imagination entity types and semantic description rules 301, analyzing corpus contents by combining an authoritative dictionary, consulting multiple expert opinions in the field, and determining three major categories of entity names, time expressions and digital expressions, namely 14 military imagination entity types and semantic description rules;

the word label generating unit 302 is configured to assign a uniform label to each word in the preprocessing result data of the data preprocessing module 100, taking a sentence as a unit, by using a method of manually labeling word attributes and automatically generating character labels;

the character tag generating unit 303 generates a character tag for the tagged text by using the open source toolkit and using the specific tagging system.

The model training module 400, based on the military scenario dictionary and the word vector matrix, establishes an entity information extraction model, and trains entity information extraction model parameters, specifically including:

a text sequence dividing unit 401 that divides an input text sequence with a sentence as a basic unit;

a hidden state sequence generating unit 402, which takes the word vector sequence generated in the text sequence segmentation unit 401 as the input of each time step of the bidirectional long and short memory neural network, and then splices the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions respectively to obtain a complete hidden state sequence;

the optimal output tag sequence generating unit 403 inputs the sequence generated in the hidden state sequence generating unit 402 into a Conditional Random Field (CRF) model to obtain an optimal output tag sequence.

The entity information extraction module 500 applies the trained entity information extraction model to extract the military scenario entity information of the text data to be predicted, and specifically includes:

text preprocessing 501, which is used for preprocessing input military scenario text data;

a vectorization representation unit 502, which performs vectorization representation on the sentence to be extracted based on the military scenario dictionary and the word vector matrix established in the word vector generation module 200, and inputs the trained model;

the entity information obtaining unit 503 calculates an input sentence vector by using the entity information extraction model, generates a sequence, inputs a Conditional Random Field (CRF) model, and extracts and obtains entity information.

Claims

1. A military scenario paperwork entity information extraction method combined with a dictionary, the method is characterized by comprising the following steps:

s1, preprocessing data: preprocessing military tape-out document data, establishing a military tape-out entity dictionary, and specifically comprising the following steps:

s1.1, establishing a corpus set: preprocessing military thought data, removing meaningless symbols, carrying out sentence segmentation according to Chinese sentence-breaking symbols, and establishing a corpus;

s1.2, establishing a military thought entity dictionary: according to the military idea related field, selecting an authority dictionary in the field, collecting proper nouns from the authority dictionary in the field, establishing a military idea entity dictionary according to the types of a military weapon species word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;

s2, generating a word vector: according to the military scenario corpus set and the military scenario entity dictionary established in the step S1.2, a military scenario dictionary and a word vector matrix are established, and the method specifically comprises the following steps:

s2.1, character statistics: counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number of a corpus dictionary;

s2.2, generating a vector matrix of the military imagination word: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military thought word vector matrix;

s3, corpus labeling: determining military thought entity type definition rules by combining an authority dictionary and corpora in the field, selecting the corpora for marking, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the military thought entity type definition rules specifically comprise:

s3.1, determining military scenario entity types and semantic description rules: analyzing the corpus content by combining an authority dictionary in the field, and determining 14 military scenario entity types and semantic description rules of three major categories of entity names, time expressions and digital expressions;

s3.2, generating word labels: a method of manually labeling vocabulary attributes and automatically generating character labels is adopted, and each word in the preprocessing result data of the step S1 is endowed with a uniform label by taking a sentence as a unit;

s3.3, generating a character label: generating a character label for the labeled text by using a specific open source toolkit and adopting a specific labeling system;

s4, model training: establishing an entity information extraction model according to a military imagination dictionary and a word vector matrix, and training parameters of the entity information extraction model, wherein the method specifically comprises the following steps:

s4.1, text sequence segmentation: the input text sequence is divided by taking sentences as basic units, and one sentence containing n characters is expressed as X ═ X (X)₁，x₂，...，x_n) Based on the military scenario dictionary and the word vector matrix established in step S2, each character X of X is divided into_iConversion into a word vector matrix V of dimension w^w∈D^w×ZA word vector e in_i：

e_i＝V^w×zⁱ (1)

In the formula, vector zⁱThe dimension is z, the ith row takes 1, the other rows take 0 vectors, and the input sentence X becomes characrer embedding word vector sequence E ═ (E)₁，e₂，...，e_n)；

S4.2, generating a hidden state sequence: the word vector sequence E generated in step S4.1 is given by (E)₁，e₂，...，e_n) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence S^BiLSTM；

S4.3, generating an optimal output label sequence: sequence S generated in step S4.2^BiLSTMInputting a conditional random field model to obtain a transfer matrix A, and recording the tag sequence of a sentence X as Y ═ Y₁，y₂，...，y_n) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:

in the formula, S^BiLSTMAs a hidden state sequence, y_iCalculating an evaluation function when the ith label is an ith label and A is a transfer matrix, and obtaining an optimal output label sequence when the maximum value is taken;

s5, entity information extraction: applying a trained entity information extraction model to extract military scenario entity information of text data to be predicted, and specifically comprising the following steps of:

s5.1, text preprocessing: preprocessing input military scenario text data;

s5.2, vectorization represents that: vectorizing the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step S2, and inputting a trained model;

s5.3, acquiring entity information: calculating the input sentence vector by using the entity information extraction model to generate a sequence S1^BiLSTMInputting the conditional random field model to obtain a transition matrix A1, wherein the tag sequence of sentence S1 is Y1 ═ Y₁，y₂，...，y_n) The evaluation function of tag Y1 of sentence S1 to be extracted:

2. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the chinese sentence break symbol comprises ". ""! ".

3. The method for extracting military affairs ideation paperwork entity information of the combined dictionary as claimed in claim 1, wherein said domain authority dictionary comprises military encyclopedia of China, military dictionary and concise military dictionary.

4. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the word vector matrix generation open source tool comprises word2vec, glove.

5. The dictionary-integrated military scenario paperbody information extraction method of claim 1, wherein the open source toolkit comprises YEDDA, brat.

6. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the specific annotation system comprises BIO, BIEOS.

7. A military scenario paperwork entity information extraction device in combination with a dictionary, the device comprising:

the data preprocessing module 100: preprocessing military tape-out document data, establishing a military tape-out entity dictionary, and specifically comprising the following steps:

corpus establishing unit 101: preprocessing military thought data, removing meaningless symbols, carrying out sentence segmentation according to Chinese sentence-breaking symbols, and establishing a corpus;

the military scenario entity dictionary establishing unit 102: according to the military thought related field, selecting an authority dictionary in the field, collecting proper nouns from a Weiwei dictionary, establishing a military thought entity dictionary according to the types of a military weapon species word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;

word vector generation module 200: according to a military thought corpus set and a military thought entity dictionary, a military thought dictionary and a word vector matrix are constructed, and the method specifically comprises the following steps:

character counting unit 201: counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number of a corpus dictionary;

the military tape word vector matrix generation unit 202: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military thought word vector matrix;

corpus tagging module 300: determining military thought entity type definition rules by combining an authoritative dictionary and corpora, selecting the corpora for marking, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the military thought entity type definition rules specifically comprise:

military scenario entity type and semantic description rule determination unit 301: analyzing the corpus content by combining an authoritative dictionary, consulting the opinions of a plurality of experts in the field, and determining three categories of entity names, time expressions and digital expressions, namely 14 military scenario entity types and semantic description rules;

the word tag generation unit 302: a method of manually labeling vocabulary attributes and automatically generating character tags is adopted, and each word in the preprocessing result data of the data preprocessing module 100 is endowed with a uniform tag by taking a sentence as a unit;

the character tag generation unit 303: using an open source toolkit, and generating a character label for a labeled text by adopting a specific labeling system;

model training module 400: establishing an entity information extraction model based on a military imagination dictionary and a word vector matrix, and training parameters of the entity information extraction model, wherein the method specifically comprises the following steps:

text sequence segmentation unit 401: segmenting an input text sequence by taking sentences as basic units;

hidden state sequence generation section 402: taking a word vector sequence generated in the text sequence segmentation unit 401 as the input of each time step of the bidirectional long and short memory neural network, and splicing the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions to obtain a complete hidden state sequence;

optimal output tag sequence generation unit 403: inputting the sequence generated in the hidden state sequence generation unit 402 into a conditional random field model to obtain an optimal output tag sequence;

the entity information extraction module 500: applying a trained entity information extraction model to extract military scenario entity information of text data to be predicted, and specifically comprising the following steps of:

the text preprocessing unit 501: preprocessing input military scenario text data;

vectorization representation unit 502: based on a military scenario dictionary and a word vector matrix established in the word vector generation module 200, vectorizing expression is carried out on sentences to be extracted, and a trained model is input;

the entity information acquisition unit 503: and calculating the input sentence vector by using an entity information extraction model, generating a sequence, inputting a conditional random field model, and extracting to obtain entity information.