CN110598203A - Military imagination document entity information extraction method and device combined with dictionary - Google Patents

Military imagination document entity information extraction method and device combined with dictionary Download PDF

Info

Publication number
CN110598203A
CN110598203A CN201910653281.3A CN201910653281A CN110598203A CN 110598203 A CN110598203 A CN 110598203A CN 201910653281 A CN201910653281 A CN 201910653281A CN 110598203 A CN110598203 A CN 110598203A
Authority
CN
China
Prior art keywords
military
dictionary
entity
entity information
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910653281.3A
Other languages
Chinese (zh)
Other versions
CN110598203B (en
Inventor
蒋序平
鲁义威
杨若鹏
张建军
卢稳新
朱巍
刘乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910653281.3A priority Critical patent/CN110598203B/en
Publication of CN110598203A publication Critical patent/CN110598203A/en
Application granted granted Critical
Publication of CN110598203B publication Critical patent/CN110598203B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for extracting military imagination document entity information by combining a dictionary, wherein the method comprises the following steps: 1. preprocessing, establishing a military scenario corpus set and a military scenario entity dictionary; 2. constructing a military scenario dictionary and a word vector matrix; 3. determining 14 types of military affairs thought entity types and semantic description rules thereof, selecting corpora for labeling, respectively establishing a training corpus set and a test corpus set, and preparing for model training; 4. establishing an entity information extraction model, and training entity information extraction model parameters; 5. and extracting military scenario entity information of the military scenario text data to be predicted. The military scenario entity information extraction-oriented method can effectively solve the problems of insufficient manual construction characteristics, strong word segmentation dependency and the like of military scenario entity information extraction, thereby improving the military scenario entity information extraction efficiency.

Description

Military imagination document entity information extraction method and device combined with dictionary
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a military-oriented entity information extraction method and device.
Background
The military scenario text is a description text which is assumed and assumed according to the attempts, situations and battle progress scenes of both fighters. The military scenario text entity information is a basic information element of military scenario data, is a basis for extracting, processing and analyzing the military scenario text data, aims to extract the military scenario text entity information, finds entities hidden in the military scenario unstructured and semi-structured text information, and extracts the entities by adopting a certain means.
At present, methods for identifying named entities in the general field mainly include rule-based methods, statistical and machine learning-based methods and deep learning-based methods. Among them, rule-based methods have high accuracy but high coverage, portability and development cost; the method based on statistics and machine learning has low development cost, but has strong dependency on feature engineering and Chinese word segmentation; the method based on deep learning is high in precision and strong in portability, but word segmentation is still needed for constructing word vectors, and the requirement on the corpus scale of the computing power is high.
In military thought entity information extraction, a rule and dictionary-based method is popular, semantic entities are extracted from military thought text data, a Conditional Random Field (CRF) model can be used for learning text features to identify entity information in a scene, and a method of combining multiple models (CRF and rule, CRF and dictionary and rule) can be used for identifying entity information. The traditional method has pertinence, but is slightly insufficient in recognition effect and expandability, is difficult to adapt to the change of military scenario information in the future and can not meet the requirements of automatic and intelligent processing of massive large data.
At present, military affairs plan entity information extraction mainly has the following problems:
1) under different scenes, a large number of entities exist in various forms such as combination, nesting, abbreviation and the like;
2) due to the difference of scene language style and habits, the number of certain entities is huge, the name forms are complex and changeable, and no strict and uniform rule exists, so that comprehensive and reasonable entity characteristics are difficult to construct;
3) the existing word segmentation tool is mainly suitable for the general field, the word segmentation accuracy rate of military scenario text data is not high, especially scene professional terms are rare in the general field, and all scene entities are difficult to contain even a scene dictionary is added, so that the recognition effect of the method with strong dependency on the word segmentation is difficult to break through the current bottleneck.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and aims at solving the practical problems of complex military scenario data, difficult manual acquisition and the like, a military scenario entity dictionary is established based on an authority dictionary in the military field, a training corpus set and a test corpus set are established by determining 14 types of military scenario entity types and semantic description rules thereof, and an entity information extraction model is trained, so that the method and the device for extracting the military scenario document entity information in combination with the dictionary are realized.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for extracting military scenario paperwork entity information in combination with a dictionary, the method comprising the steps of:
s1, preprocessing data, preprocessing military thought paperwork data, establishing a military thought entity dictionary, specifically comprising:
s1.1, establishing a corpus set, namely preprocessing military scenario data, removing meaningless symbols, performing clause division according to Chinese sentence break symbols and establishing the corpus set;
the Chinese sentence break symbol comprises. ""! "and the like;
s1.2, establishing a military thought entity dictionary, selecting an authority dictionary in the field according to the field related to military thought, collecting proper nouns from a power dictionary, establishing the military thought entity dictionary according to the types of a military category word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;
the authoritative dictionary in the field is published and issued for public publishing in the field, and widely recognized dictionaries are available, including but not limited to dictionaries of military encyclopedia of China, military dictionary and concise military dictionary.
S2, generating word vectors, constructing a military scenario dictionary and a word vector matrix according to a military scenario corpus and a military scenario entity dictionary, and specifically comprising the following steps:
s2.1, counting characters, namely counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number Z of the corpus dictionary;
s2.2, generating a vector matrix of the military imagination character, and generating an open-source tool training corpus dictionary by using the vector matrix of the character to obtain a multidimensional vector matrix of the military imagination character;
the word vector matrix generation open source tools include, but are not limited to, word2vec, glove, and the like.
S3, corpus labeling, determining a complete military scenario entity type definition rule by combining an authoritative dictionary and a corpus, selecting the corpus for labeling, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the method specifically comprises the following steps:
s3.1, military imagination entity type and semantic description rule determination, wherein the method comprises the steps of analyzing corpus content by combining an authoritative dictionary, consulting the opinions of multiple experts in the field, and determining 14 military imagination entity types and semantic description rules of three categories of entity names, time expressions and digital expressions;
s3.2, generating word labels, namely, assigning a uniform label to each word in the preprocessing result data of the step S1 by taking sentences as units by adopting a method of manually marking word attributes and automatically generating character labels;
s3.3, generating character labels, namely generating the character labels for the labeled texts by using an open source toolkit and adopting a specific labeling system;
the open source toolkit includes, but is not limited to, YEDDA, brat, etc.;
the specific labeling system comprises but is not limited to labeling systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a prefix character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a beginning character, an l label represents a middle character, an E label represents an end character, a W label represents a single character, and an O label represents a non-entity character.
S4, model training, which is used for establishing an entity information extraction model according to a military imagination dictionary and a word vector matrix, and training entity information extraction model parameters, and specifically comprises the following steps:
s4.1, text sequence segmentation, which segments an input text sequence in sentence units, where a sentence containing n words is expressed as X ═ X (X)1,x2,...,xn) Based on the military scenario dictionary and the word vector matrix established in step S2, each character X of X is divided intoiConversion into a word vector matrix V of dimension ww∈Dw×ZA word vector e ini
ei=Vw×zi (1)
In the formula, vector ziFor the dimension z, the ith row takes 1, the other rows take 0 vectors, and the input sentence X becomes the character embedding word vector sequence E ═ E (E1,e2,...,en);
S4.2, hidden state sequence generation, where the word vector sequence E generated in step S4.1 is equal to (E)1,e2,...,en) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence SBiLSTM
S4.3, optimal output tag sequence generation, willSequence S generated in step S4.2BiLSTMInputting a Conditional Random Field (CRF) model to obtain a transition matrix A, and recording the tag sequence of sentence X as Y ═ Y1,y2,...,yn) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:
in the formula, SBiLSTMAs a hidden state sequence, yiAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix during model training, and obtaining the optimal output label sequence when taking the maximum value.
S5, entity information extraction, which is used for applying the trained entity information extraction model to perform military scenario entity information extraction on text data to be predicted, and specifically comprises the following steps:
s5.1, text preprocessing is used for preprocessing input military scenario text data;
s5.2, vectorizing expression, namely vectorizing expression is carried out on the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step S2, and a trained model is input;
s5.3, acquiring entity information, calculating input sentence vectors by applying an entity information extraction model, and generating a sequence S1BiLSTMInputting a Conditional Random Field (CRF) model to obtain a transition matrix a1, wherein the tag sequence of sentence S1 is Y1 ═ Y1,y2,...,yn) The evaluation function of tag Y1 of sentence S1 to be extracted:
wherein, S1BiLSTMAs a hidden state sequence, yiAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix, obtaining the optimal output label sequence when the maximum value is taken, and extracting to obtain the entity information of S1.
The invention adopts the military idea document entity information extraction method combined with the dictionary, and has the advantages that:
1. the problems of insufficient manual construction characteristics, strong word segmentation dependency and the like in the extraction of military scenario document entity information are effectively solved;
2. the workload of military scenario data acquisition is greatly reduced;
3. the information extraction efficiency of military scenario document entities is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a military scenario paperwork entity information extraction method in combination with a dictionary;
fig. 2 is a block diagram of the composition structure of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flow chart of the military scenario paperwork entity information extraction method combining a dictionary is shown, which comprises the following steps:
s1, preprocessing data, preprocessing military thought paperwork data, establishing a military thought entity dictionary, specifically comprising:
s1.1, establishing a corpus set, and preprocessing and removing military scenario data. ""! ' and the like, and carrying out sentence segmentation according to Chinese sentence-breaking symbols to establish a corpus;
s1.2, establishing a military imagination entity dictionary, selecting the fields related to the military imagination, publishing and issuing the dictionary in the fields in an open way to obtain a widely recognized dictionary, such as dictionaries of Chinese military encyclopedia, military dictionary, concise military dictionary and the like, collecting proper nouns from a Weiwei dictionary, establishing the military imagination entity dictionary according to the types of military categories, weapon equipment word stock, facility word stock and action word stock, and analyzing and labeling the semantic structure of the entity.
S2, generating word vectors, constructing a military scenario dictionary and a word vector matrix according to a military scenario corpus and a military scenario entity dictionary, and specifically comprising the following steps:
s2.1, counting characters, namely counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number Z of the corpus dictionary;
and S2.2, generating a military imagination word vector matrix, and generating an open source tool training corpus dictionary by using word vector matrices such as word2vec and glove to obtain a multidimensional military imagination word vector matrix.
S3, corpus labeling, determining a complete military scenario entity type definition rule by combining an authoritative dictionary and a corpus, selecting the corpus for labeling, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the method specifically comprises the following steps:
s3.1, military imagination entity type and semantic description rule determination, combining an authoritative dictionary, analyzing corpus content, consulting multiple expert opinions in the field, determining three major categories of an entity name, a time expression and a digital expression, namely 14 military imagination entity types and semantic description rules, and showing in the following table:
s3.2, generating word labels, namely, assigning a uniform label to each word in the preprocessing result data of the step S1 by taking sentences as units by adopting a method of manually marking word attributes and automatically generating character labels;
s3.3, generating character labels, namely generating the character labels for the labeled texts by using open source toolkits such as YEDDA, brat and the like and adopting labeling systems such as BIO, BIEOS and the like, wherein B labels in the BIO system represent initial characters, I labels represent characters in words, and O labels represent non-entity characters; in the BMEWO system, a B label represents a beginning character, an I label represents a middle character, an E label represents an end character, a W label represents a single character, and an O label represents a non-entity character.
S4, model training, namely establishing an entity information extraction model based on a military imagination dictionary and a word vector matrix, and training entity information extraction model parameters, wherein the model training specifically comprises the following steps:
s4.1, text sequence segmentation, which segments an input text sequence in sentence units, where a sentence containing n words is expressed as X ═ X (X)1,x2,...,xn) Based on the military imagination dictionary and the word vector matrix established in the step 2, each character X of the X is divided into three partsiConversion into a word vector matrix V of dimension ww∈Dw×ZA word vector e ini
ei=Vw×zi (1)
In the formula, vector ziFor the dimension z, the ith row takes 1, the other rows take 0 vectors, and the input sentence X becomes the character embedding word vector sequence E ═ E (E1,e2,...,en);
S4.2, hidden state sequence generation, namely, the word vector sequence E generated in step 4.1 is equal to (E)1,e2,...,en) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence SBiLSTM
S4.3, generating an optimal output label sequence, and converting the sequence S generated in the step 4.2BiLSTMInputting Conditional Random Fields (CRF)) Modeling, obtaining a transition matrix A, and recording the tag sequence of the sentence X as Y ═ Y1,y2,...,yn) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:
in the formula, SBiLSTMAs a hidden state sequence, yiAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix during model training, and obtaining the optimal output label sequence when taking the maximum value.
S5, entity information extraction, wherein the trained entity information extraction model is applied to extract military scenario entity information of text data to be predicted, and the method specifically comprises the following steps:
s5.1, text preprocessing is used for preprocessing input military scenario text data;
s5.2, vectorizing expression, namely vectorizing expression is carried out on the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step 2, and a trained model is input;
s5.3, acquiring entity information, calculating input sentence vectors by applying an entity information extraction model, and generating a sequence S1BiLSTMInputting a Conditional Random Field (CRF) model to obtain a transition matrix a1, wherein the tag sequence of sentence S1 is Y1 ═ Y1,y2,...,yn) The evaluation function of tag Y1 of sentence S1 to be extracted:
wherein, S1BiLSTMAs a hidden state sequence, yiAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix, obtaining the optimal output label sequence when the maximum value is taken, and extracting to obtain the entity information of S1.
Referring to fig. 2, a block diagram of an embodiment of an extracting device for military imagination document entity information combined with a dictionary according to the present invention is shown, which specifically includes the following components:
the data preprocessing module 100 is configured to preprocess military scenario document data and establish a military scenario entity dictionary, and specifically includes:
a corpus establishing unit 101, which preprocesses military scenario data, removes meaningless symbols, performs sentence division according to Chinese sentence break symbols, and establishes a corpus;
the military affairs thought entity dictionary establishing unit 102 selects an authority dictionary in the field according to the field related to the military affairs thought, collects proper nouns from the WEI dictionary, establishes a military affairs thought entity dictionary according to the types of military affairs word stock, weapon equipment word stock, facility word stock and action word stock, and analyzes and labels the semantic structure of the entity.
The word vector generation module 200 constructs a military scenario dictionary and a word vector matrix according to a military scenario corpus and a military scenario entity dictionary, and specifically comprises:
the character counting unit 201 is used for counting all characters appearing in the military thought entity dictionary and the domain authority dictionary, establishing a digital index for each character to obtain the military thought dictionary, and recording the total word number of the corpus dictionary;
the military imagination word vector matrix generation unit 202 generates an open source tool training corpus dictionary by using the word vector matrix to obtain a military imagination word vector matrix with a certain dimension.
The corpus labeling module 300, in combination with the authoritative dictionary and the corpus in the field, determines the complete military scenario entity type definition rule, selects the corpus to label, establishes a training corpus set and a testing corpus set respectively, prepares for model training, and specifically includes:
determining military imagination entity types and semantic description rules 301, analyzing corpus contents by combining an authoritative dictionary, consulting multiple expert opinions in the field, and determining three major categories of entity names, time expressions and digital expressions, namely 14 military imagination entity types and semantic description rules;
the word label generating unit 302 is configured to assign a uniform label to each word in the preprocessing result data of the data preprocessing module 100, taking a sentence as a unit, by using a method of manually labeling word attributes and automatically generating character labels;
the character tag generating unit 303 generates a character tag for the tagged text by using the open source toolkit and using the specific tagging system.
The model training module 400, based on the military scenario dictionary and the word vector matrix, establishes an entity information extraction model, and trains entity information extraction model parameters, specifically including:
a text sequence dividing unit 401 that divides an input text sequence with a sentence as a basic unit;
a hidden state sequence generating unit 402, which takes the word vector sequence generated in the text sequence segmentation unit 401 as the input of each time step of the bidirectional long and short memory neural network, and then splices the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions respectively to obtain a complete hidden state sequence;
the optimal output tag sequence generating unit 403 inputs the sequence generated in the hidden state sequence generating unit 402 into a Conditional Random Field (CRF) model to obtain an optimal output tag sequence.
The entity information extraction module 500 applies the trained entity information extraction model to extract the military scenario entity information of the text data to be predicted, and specifically includes:
text preprocessing 501, which is used for preprocessing input military scenario text data;
a vectorization representation unit 502, which performs vectorization representation on the sentence to be extracted based on the military scenario dictionary and the word vector matrix established in the word vector generation module 200, and inputs the trained model;
the entity information obtaining unit 503 calculates an input sentence vector by using the entity information extraction model, generates a sequence, inputs a Conditional Random Field (CRF) model, and extracts and obtains entity information.

Claims (7)

1. A military scenario paperwork entity information extraction method combined with a dictionary, the method is characterized by comprising the following steps:
s1, preprocessing data: preprocessing military tape-out document data, establishing a military tape-out entity dictionary, and specifically comprising the following steps:
s1.1, establishing a corpus set: preprocessing military thought data, removing meaningless symbols, carrying out sentence segmentation according to Chinese sentence-breaking symbols, and establishing a corpus;
s1.2, establishing a military thought entity dictionary: according to the military idea related field, selecting an authority dictionary in the field, collecting proper nouns from the authority dictionary in the field, establishing a military idea entity dictionary according to the types of a military weapon species word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;
s2, generating a word vector: according to the military scenario corpus set and the military scenario entity dictionary established in the step S1.2, a military scenario dictionary and a word vector matrix are established, and the method specifically comprises the following steps:
s2.1, character statistics: counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number of a corpus dictionary;
s2.2, generating a vector matrix of the military imagination word: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military thought word vector matrix;
s3, corpus labeling: determining military thought entity type definition rules by combining an authority dictionary and corpora in the field, selecting the corpora for marking, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the military thought entity type definition rules specifically comprise:
s3.1, determining military scenario entity types and semantic description rules: analyzing the corpus content by combining an authority dictionary in the field, and determining 14 military scenario entity types and semantic description rules of three major categories of entity names, time expressions and digital expressions;
s3.2, generating word labels: a method of manually labeling vocabulary attributes and automatically generating character labels is adopted, and each word in the preprocessing result data of the step S1 is endowed with a uniform label by taking a sentence as a unit;
s3.3, generating a character label: generating a character label for the labeled text by using a specific open source toolkit and adopting a specific labeling system;
s4, model training: establishing an entity information extraction model according to a military imagination dictionary and a word vector matrix, and training parameters of the entity information extraction model, wherein the method specifically comprises the following steps:
s4.1, text sequence segmentation: the input text sequence is divided by taking sentences as basic units, and one sentence containing n characters is expressed as X ═ X (X)1,x2,...,xn) Based on the military scenario dictionary and the word vector matrix established in step S2, each character X of X is divided intoiConversion into a word vector matrix V of dimension ww∈Dw×ZA word vector e ini
ei=Vw×zi (1)
In the formula, vector ziThe dimension is z, the ith row takes 1, the other rows take 0 vectors, and the input sentence X becomes characrer embedding word vector sequence E ═ (E)1,e2,...,en);
S4.2, generating a hidden state sequence: the word vector sequence E generated in step S4.1 is given by (E)1,e2,...,en) As the input of each time step of the bidirectional long-short memory neural network, the hidden state sequence output by the forward long-short memory neural network and the hidden state output by the reverse long-short memory neural network are spliced according to positions respectively to obtain a complete hidden state sequence SBiLSTM
S4.3, generating an optimal output label sequence: sequence S generated in step S4.2BiLSTMInputting a conditional random field model to obtain a transfer matrix A, and recording the tag sequence of a sentence X as Y ═ Y1,y2,...,yn) Considering that more entity types are extracted in the entity identification process, in order to improve the feature discrimination, an index taking method is adopted to construct an evaluation function of a label Y of a sentence X:
in the formula, SBiLSTMAs a hidden state sequence, yiCalculating an evaluation function when the ith label is an ith label and A is a transfer matrix, and obtaining an optimal output label sequence when the maximum value is taken;
s5, entity information extraction: applying a trained entity information extraction model to extract military scenario entity information of text data to be predicted, and specifically comprising the following steps of:
s5.1, text preprocessing: preprocessing input military scenario text data;
s5.2, vectorization represents that: vectorizing the sentence to be extracted S1 based on the military imagination dictionary and the word vector matrix established in the step S2, and inputting a trained model;
s5.3, acquiring entity information: calculating the input sentence vector by using the entity information extraction model to generate a sequence S1BiLSTMInputting the conditional random field model to obtain a transition matrix A1, wherein the tag sequence of sentence S1 is Y1 ═ Y1,y2,...,yn) The evaluation function of tag Y1 of sentence S1 to be extracted:
wherein, S1BiLSTMAs a hidden state sequence, yiAnd (4) calculating an evaluation function for the ith label and A for the transfer matrix, obtaining the optimal output label sequence when the maximum value is taken, and extracting to obtain the entity information of S1.
2. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the chinese sentence break symbol comprises ". ""! ".
3. The method for extracting military affairs ideation paperwork entity information of the combined dictionary as claimed in claim 1, wherein said domain authority dictionary comprises military encyclopedia of China, military dictionary and concise military dictionary.
4. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the word vector matrix generation open source tool comprises word2vec, glove.
5. The dictionary-integrated military scenario paperbody information extraction method of claim 1, wherein the open source toolkit comprises YEDDA, brat.
6. The dictionary-integrated military scenario paperwork entity information extraction method of claim 1, wherein the specific annotation system comprises BIO, BIEOS.
7. A military scenario paperwork entity information extraction device in combination with a dictionary, the device comprising:
the data preprocessing module 100: preprocessing military tape-out document data, establishing a military tape-out entity dictionary, and specifically comprising the following steps:
corpus establishing unit 101: preprocessing military thought data, removing meaningless symbols, carrying out sentence segmentation according to Chinese sentence-breaking symbols, and establishing a corpus;
the military scenario entity dictionary establishing unit 102: according to the military thought related field, selecting an authority dictionary in the field, collecting proper nouns from a Weiwei dictionary, establishing a military thought entity dictionary according to the types of a military weapon species word bank, a weapon equipment word bank, a facility word bank and an action word bank, and analyzing and labeling the semantic structure of the entity;
word vector generation module 200: according to a military thought corpus set and a military thought entity dictionary, a military thought dictionary and a word vector matrix are constructed, and the method specifically comprises the following steps:
character counting unit 201: counting all characters appearing in a military thought entity dictionary and a domain authority dictionary, establishing a digital index for each character to obtain a military thought dictionary, and recording the total word number of a corpus dictionary;
the military tape word vector matrix generation unit 202: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military thought word vector matrix;
corpus tagging module 300: determining military thought entity type definition rules by combining an authoritative dictionary and corpora, selecting the corpora for marking, respectively establishing a training corpus set and a testing corpus set, and preparing for model training, wherein the military thought entity type definition rules specifically comprise:
military scenario entity type and semantic description rule determination unit 301: analyzing the corpus content by combining an authoritative dictionary, consulting the opinions of a plurality of experts in the field, and determining three categories of entity names, time expressions and digital expressions, namely 14 military scenario entity types and semantic description rules;
the word tag generation unit 302: a method of manually labeling vocabulary attributes and automatically generating character tags is adopted, and each word in the preprocessing result data of the data preprocessing module 100 is endowed with a uniform tag by taking a sentence as a unit;
the character tag generation unit 303: using an open source toolkit, and generating a character label for a labeled text by adopting a specific labeling system;
model training module 400: establishing an entity information extraction model based on a military imagination dictionary and a word vector matrix, and training parameters of the entity information extraction model, wherein the method specifically comprises the following steps:
text sequence segmentation unit 401: segmenting an input text sequence by taking sentences as basic units;
hidden state sequence generation section 402: taking a word vector sequence generated in the text sequence segmentation unit 401 as the input of each time step of the bidirectional long and short memory neural network, and splicing the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions to obtain a complete hidden state sequence;
optimal output tag sequence generation unit 403: inputting the sequence generated in the hidden state sequence generation unit 402 into a conditional random field model to obtain an optimal output tag sequence;
the entity information extraction module 500: applying a trained entity information extraction model to extract military scenario entity information of text data to be predicted, and specifically comprising the following steps of:
the text preprocessing unit 501: preprocessing input military scenario text data;
vectorization representation unit 502: based on a military scenario dictionary and a word vector matrix established in the word vector generation module 200, vectorizing expression is carried out on sentences to be extracted, and a trained model is input;
the entity information acquisition unit 503: and calculating the input sentence vector by using an entity information extraction model, generating a sequence, inputting a conditional random field model, and extracting to obtain entity information.
CN201910653281.3A 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary Active CN110598203B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910653281.3A CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910653281.3A CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Publications (2)

Publication Number Publication Date
CN110598203A true CN110598203A (en) 2019-12-20
CN110598203B CN110598203B (en) 2023-08-01

Family

ID=68853045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910653281.3A Active CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Country Status (1)

Country Link
CN (1) CN110598203B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN111309925A (en) * 2020-02-10 2020-06-19 同方知网(北京)技术有限公司 Knowledge graph construction method of military equipment
CN111324745A (en) * 2020-02-18 2020-06-23 深圳市一面网络技术有限公司 Word stock generation method and device
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map
CN111444723A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction model training method and device, computer equipment and storage medium
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111611799A (en) * 2020-05-07 2020-09-01 北京智通云联科技有限公司 Dictionary and sequence labeling model based entity attribute extraction method, system and equipment
CN112036184A (en) * 2020-08-31 2020-12-04 湖南星汉数智科技有限公司 Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN112036183A (en) * 2020-08-31 2020-12-04 湖南星汉数智科技有限公司 Word segmentation method and device based on BilSTM network model and CRF model, computer device and computer storage medium
CN113051887A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Method, system and device for extracting announcement information elements
CN113254594A (en) * 2021-06-21 2021-08-13 国能信控互联技术有限公司 Smart power plant-oriented safety knowledge graph construction method and system
CN113326700A (en) * 2021-02-26 2021-08-31 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN113806481A (en) * 2021-09-17 2021-12-17 中国人民解放军国防科技大学 Operation event extraction method oriented to encyclopedic data
CN115906844A (en) * 2022-11-02 2023-04-04 中国兵器工业计算机应用技术研究所 Information extraction method and system based on rule template

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789449A (en) * 2011-05-20 2012-11-21 日电(中国)有限公司 Method and device for evaluating comment text
US20130179169A1 (en) * 2012-01-11 2013-07-11 National Taiwan Normal University Chinese text readability assessing system and method
CN105138724A (en) * 2015-07-17 2015-12-09 中国电子科技集团公司电子科学研究院 Universal extendable open simulation scenario editing method and apparatus
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
US20180121799A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Training a Joint Many-Task Neural Network Model using Successive Regularization
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109446523A (en) * 2018-10-23 2019-03-08 重庆誉存大数据科技有限公司 Entity attribute extraction model based on BiLSTM and condition random field
US20190138599A1 (en) * 2017-11-09 2019-05-09 Conduent Business Services, Llc Performing semantic analyses of user-generated text content using a lexicon
CN109858041A (en) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789449A (en) * 2011-05-20 2012-11-21 日电(中国)有限公司 Method and device for evaluating comment text
US20130179169A1 (en) * 2012-01-11 2013-07-11 National Taiwan Normal University Chinese text readability assessing system and method
CN105138724A (en) * 2015-07-17 2015-12-09 中国电子科技集团公司电子科学研究院 Universal extendable open simulation scenario editing method and apparatus
US20180121799A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Training a Joint Many-Task Neural Network Model using Successive Regularization
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
US20190138599A1 (en) * 2017-11-09 2019-05-09 Conduent Business Services, Llc Performing semantic analyses of user-generated text content using a lexicon
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109446523A (en) * 2018-10-23 2019-03-08 重庆誉存大数据科技有限公司 Entity attribute extraction model based on BiLSTM and condition random field
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM
CN109858041A (en) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
宋瑞亮: "面向军事领域的命名实体识别及相关信息提取关键技术研究", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 138 - 4703 *
李健龙等: "基于双向LSTM的军事命名实体识别", 《计算机工程与科学》 *
李健龙等: "基于双向LSTM的军事命名实体识别", 《计算机工程与科学》, no. 04, 15 April 2019 (2019-04-15), pages 713 - 718 *
王学锋等: "基于深度学习的军事命名实体识别方法", 《装甲兵工程学院学报》 *
王学锋等: "基于深度学习的军事命名实体识别方法", 《装甲兵工程学院学报》, no. 04, 31 August 2018 (2018-08-31), pages 94 - 98 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN113051887A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Method, system and device for extracting announcement information elements
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111309925A (en) * 2020-02-10 2020-06-19 同方知网(北京)技术有限公司 Knowledge graph construction method of military equipment
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map
CN111324742B (en) * 2020-02-10 2024-01-23 同方知网数字出版技术股份有限公司 Method for constructing digital human knowledge graph
CN111309925B (en) * 2020-02-10 2023-06-30 同方知网数字出版技术股份有限公司 Knowledge graph construction method for military equipment
CN111324745A (en) * 2020-02-18 2020-06-23 深圳市一面网络技术有限公司 Word stock generation method and device
CN111444723A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction model training method and device, computer equipment and storage medium
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111611799B (en) * 2020-05-07 2023-06-02 北京智通云联科技有限公司 Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model
CN111611799A (en) * 2020-05-07 2020-09-01 北京智通云联科技有限公司 Dictionary and sequence labeling model based entity attribute extraction method, system and equipment
CN112036183B (en) * 2020-08-31 2024-02-02 湖南星汉数智科技有限公司 Word segmentation method, device, computer device and computer storage medium based on BiLSTM network model and CRF model
CN112036184A (en) * 2020-08-31 2020-12-04 湖南星汉数智科技有限公司 Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN112036183A (en) * 2020-08-31 2020-12-04 湖南星汉数智科技有限公司 Word segmentation method and device based on BilSTM network model and CRF model, computer device and computer storage medium
CN113326700A (en) * 2021-02-26 2021-08-31 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN113326700B (en) * 2021-02-26 2024-05-14 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN113254594A (en) * 2021-06-21 2021-08-13 国能信控互联技术有限公司 Smart power plant-oriented safety knowledge graph construction method and system
CN113254594B (en) * 2021-06-21 2022-01-14 国能信控互联技术有限公司 Smart power plant-oriented safety knowledge graph construction method and system
CN113806481A (en) * 2021-09-17 2021-12-17 中国人民解放军国防科技大学 Operation event extraction method oriented to encyclopedic data
CN115906844A (en) * 2022-11-02 2023-04-04 中国兵器工业计算机应用技术研究所 Information extraction method and system based on rule template
CN115906844B (en) * 2022-11-02 2023-08-29 中国兵器工业计算机应用技术研究所 Rule template-based information extraction method and system

Also Published As

Publication number Publication date
CN110598203B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN111222305B (en) Information structuring method and device
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN108959242B (en) Target entity identification method and device based on part-of-speech characteristics of Chinese characters
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN104881458B (en) A kind of mask method and device of Web page subject
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN109062904B (en) Logic predicate extraction method and device
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN110321549B (en) New concept mining method based on sequential learning, relation mining and time sequence analysis
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN114881043B (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN109190099B (en) Sentence pattern extraction method and device
Stewart et al. Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
CN113268576A (en) Deep learning-based department semantic information extraction method and device
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN110110326B (en) Text cutting method based on subject information
CN111444720A (en) Named entity recognition method for English text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant