CN113221569A - Method for extracting text information of damage test - Google Patents

Method for extracting text information of damage test Download PDF

Info

Publication number
CN113221569A
CN113221569A CN202110584191.0A CN202110584191A CN113221569A CN 113221569 A CN113221569 A CN 113221569A CN 202110584191 A CN202110584191 A CN 202110584191A CN 113221569 A CN113221569 A CN 113221569A
Authority
CN
China
Prior art keywords
text
layer
damage test
sequence
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110584191.0A
Other languages
Chinese (zh)
Inventor
张磊
汪剑辉
王继民
孔德锋
王幸
王晓峰
杨江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA
Original Assignee
Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA filed Critical Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA
Priority to CN202110584191.0A priority Critical patent/CN113221569A/en
Publication of CN113221569A publication Critical patent/CN113221569A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a damage test text information extraction method, which comprises the following steps: constructing a damage test text labeling data set, replacing all continuous numeric strings by a specific character mark, and dividing the damage test text data set into a training set and a testing set for training an information extraction model; constructing an entity extraction model adopting a neural network based on a numerical mask mechanism, and training the entity extraction model by using the training set; and extracting the text information of the test set by using the trained entity extraction model. According to the method for extracting the information of the damage test sample, the neural network algorithm is used for extracting the information of the damage test text, and various traditional models are improved according to the characteristic that the damage test text has a large amount of numerical information, so that the extraction effect of the models is improved.

Description

Method for extracting text information of damage test
Technical Field
The invention relates to the technical field of information, in particular to a method for extracting text information of a damage test.
Background
The test is the most direct and reliable means for developing damage effect and damage assessment research, but most of the test research results are published in an unstructured literature form, and how to collect and extract the test data therein and perform structured processing is an important problem to be solved urgently for the construction of large damage data.
Information extraction is the task of taking natural language text as input and generating structured information specified by some criteria. Various subtasks of information extraction, such as named entity recognition, relational extraction, coreference resolution, named entity linking, knowledge base reasoning, etc., constitute various downstream Natural Language Processing (NLP) tasks, such as machine translation, question answering systems, natural language understanding, text summarization, and intelligent machine assistants such as Siri and Cortana. The entity extraction can divide the boundary of an entity and determine the type of the entity, and the entity is one of the smallest units which natural language processing needs to understand, so the entity extraction is the most basic task of the natural language processing task, and a plurality of research methods mainly comprising a rule-based method, a statistical learning-based method and a deep learning-based method have been developed at present.
The rule-based approach does not require too many training samples, but its extraction recall is low. Therefore, research is mainly focused on statistical learning methods and deep learning-based methods, and due to the rapid development of deep learning models, most of the existing entity extraction models are deep learning models, but the deep learning models are large in calculation amount and low in accuracy. Under the condition of sufficient data quantity, the feature extraction capability of the simple statistical learning method is much poorer than that of the deep learning method, and if the deep learning method is simply utilized to identify the entity by utilizing the classification idea, the method can also be used for solving the problem of the special property of the damage field: the numeric characters have a high percentage, so that the overall extraction effect is poor.
Disclosure of Invention
The invention aims to provide a method for extracting damage test text information, which aims to solve the problems in the prior art, utilizes a neural network algorithm to extract the damage test text information, and utilizes the characteristic that the damage test text has a large amount of numerical information to improve various traditional models so as to improve the extraction effect of the models.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a damage test text information extraction method, which comprises the following steps:
constructing a damage test text labeling data set, and dividing the damage test text labeling data set into a training set and a testing set, wherein the damage test text is used for recording damage effects and damage evaluation research data, and the damage test text contains a large amount of numerical information;
constructing an entity extraction model by adopting a neural network based on a numerical mask mechanism, and training the entity extraction model by utilizing the training set, wherein the entity extraction model is used for extracting entities in the damage test text, and the entities are natural language processing units;
and extracting the text information of the test set by using the trained entity extraction model.
Preferably, the entity extraction model comprises an input layer, a masking layer, an embedding layer, an encoding layer, a CRF output layer and an entity extraction layer;
the input layer is used for segmenting the texts in the training set and outputting a text sequence; the mask layer is used for replacing numerical value information in the text sequence to obtain a mask layer output sequence;
the embedded layer is used for vectorizing the output sequence of the mask layer to obtain a word vector matrix;
the coding layer is used for coding the word vector matrix and outputting a label sequence;
the CRF output layer is used for establishing a dependency relationship among the labels in the label sequence;
the entity extraction layer is used for extracting the entities in the label sequence.
Preferably, the constructing of the damage test text annotation data set comprises the following steps:
collecting a plurality of damage test files, and converting the damage test files into texts;
filtering the remote characters in the text by using a regular expression, reserving Chinese characters, numerical values and English characters in the text, segmenting the filtered text by taking sentences as units, and replacing all numeric strings in the filtered text by using specific character marks to obtain segmented text;
and determining the entity type of the damage experiment sample, and labeling the segmented text according to the entity type to obtain the damage experiment text labeling data set.
Preferably, the segmenting, by the input layer, the text in the training set to output the text sequence includes:
inputting the training set, segmenting the training set by taking characters as units to obtain a text sequence, and uniformly segmenting numerical information by adopting a regular expression, wherein the numerical information consists of continuous numerical characters in the text sequence to obtain a segmented sequence.
Preferably, the replacing, by the mask layer, the numerical information in the text sequence to obtain a mask layer output sequence includes:
and inputting the cut sequence, and replacing the numerical information by adopting a specific character mark to obtain the output sequence of the mask layer.
Preferably, the embedding layer is constructed by adopting static word vector embedding or dynamic word vector embedding.
Preferably, the coding layer is constructed using BILSTM or BIGRU.
Preferably, the entity extraction model training employs an Adam optimizer, which is configured to optimize a loss function to minimize the loss function.
Preferably, the entity types include, but are not limited to, elastomeric shape, elastomeric type, elastomeric material type.
The invention discloses the following technical effects:
according to the method for extracting the damage test sample information, the neural network algorithm is used for extracting the information of the damage test text, and the characteristic that a large amount of numerical information exists in the damage test text is utilized to improve various traditional models, so that the extraction effect of the models is improved. The technology utilizes the characteristic that the damage test text has more numerical information, and adds a numerical mask mechanism to a common entity extraction model based on deep learning, so that the entity extraction model can greatly improve the training efficiency on the current damage test text compared with a model without the numerical mask mechanism, and the entity extraction model can be trained by using less epoch numbers to obtain an entity extraction model with higher F1 on a test set, and the F1 value of the stored optimal model on the test set is higher than that of the model without the numerical mask mechanism.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a method for extracting information based on a damage test sample according to the present invention;
FIG. 2 is a schematic diagram of an entity extraction model structure based on a numerical mask mechanism according to the present invention;
FIG. 3 is a schematic diagram of a flow of manual labeling of a data set according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a numerical value masking step in an embodiment of the present invention, where fig. (a) is a schematic diagram of an original text for DET labeling, fig. (b) is a schematic diagram of a numerical value as a character, and fig. (c) is a schematic diagram of a numerical value information masking operation;
FIG. 5 is a graph showing the results of the BILSTM-CRF and BILSTM-MASK-CRF assays of the present invention;
FIG. 6 is a graph showing the results of tests for the predicted effect of BIGRU-CRF and BIGRU-MASK-CRF according to the present invention;
FIG. 7 is a graph showing the results of the BERT-BILSTM-CRF and BERT-BILSTM-MASK-CRF assays of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides a method for extracting damage test text information, which comprises the following steps with reference to fig. 1:
step 1, collecting a plurality of damage test files, converting the damage test files into texts, filtering out remote characters in the texts by adopting a regular expression, only retaining Chinese characters, numerical values and English characters in the texts, carrying out text segmentation, constructing a text data set according to entity information to be retained, labeling the entity information, and dividing the labeled data set into a training set and a test set.
S11, collecting 198 damage test papers in this embodiment, extracting damage test texts by using a text tool, filtering the extracted texts by using a regular expression, and then segmenting the filtered texts by using sentences as units. Determining the entity types to be extracted according to the expert opinions, retaining entity information, finally constructing a data set comprising 865 long texts, forming a corpus by the data set, and summarizing the retained entity types in the corpus, wherein the entity types refer to table 1:
table 1 damage test text test entity type table
Figure RE-GDA0003144798120000061
Figure RE-GDA0003144798120000071
And S12, manually labeling the data set obtained in S11 according to the labeling method shown in the figure 3, wherein the labeling is carried out in a double-person labeling and one-person auditing manner, namely, two researchers A and B who master the relevant knowledge in the field label the data respectively, the other researcher C examines the labeled data, if the examination does not pass, the current data is labeled by the other researcher, if the examination still does not pass, the current data is stored, and then the three researchers discuss together and label the data. And 2224 labeled entities are finally obtained. The labeled corpus data set is referred to as "damage test text" herein, abbreviated as det (dam experimental text), the labeled data set format is stored in an IOB label format, and the data set format is shown in fig. 4 (a). The annotated entities are analyzed and given the label at the time of IOB labeling and the number of entities in the corpus for each entity, as shown in table (1).
(4) The labeled data set was randomly divided into a training set and a test set at a 7:3 ratio.
And 2, constructing an entity extraction model by adopting Keras (high-level neural network API), training the entity extraction model, and extracting entity information in the damage test text by using the entity extraction model.
As shown in fig. 2, the entity extraction model in the present invention includes an input layer, a mask layer, an embedding layer, an encoding layer, a CRF output layer, and an entity extraction layer.
S201, processing a text to be input by using an input layer;
the input layer is a "chinese text", in this embodiment, the input layer is a certain paragraph D in the training set, and the text is segmented by using characters as segmentation units to obtain a text sequence D ═ X1,X2,…,XSIn which X isi(i ═ 1,2, … … S) denotes characters constituting a text sequence. When the text sequence D is divided into sequences by taking characters as units, coherent numerical characters are uniformly divided into numerical information by using a regular expression, the numerical characters represent characters expressing numerical meanings and not only include Arabic numerals but also Chinese number case, and the character sequence after being divided is F ═ F1,f2,…,fTIn which fj(j ═ 1,2, … … T) represents sequence constituent elements, and T ≦ S.
S202, establishing a mask layer, processing an output text sequence of an input layer by using the mask layer through a mask mechanism, and obtaining an output result;
the input of the mask layer is the output F of the input layer, and all the numerical information in F is identified by special characters<Num>Instead, this step is called a "masked number" and the output is M ═ M1,m2,…,mx},mk(k ═ 1,2, … … x) represents a constituent element of M. Due to the special characteristics of the damaged text, a large number of numerical characters exist in sentences and paragraphs, and the continuous numerical characters occupy a relatively large length of the sequence, so that the difficulty of learning sequence labeling of the conventional neural networks BILSTM and BIGRU is increased. The operation of the 'mask numerical value' greatly shortens the whole sequence of the text and divides the boundary of the 'numerical information' in advance on the premise of not influencing the semantic meaning of the text, so that the network can not only correspondingly reduce the error rate of judging the numerical boundary, but also determine which entity type the current 'numerical information' entity belongs to with less training steps and training time. This "value masking" operation may therefore result in the neural network encoder learning better the text semantic rules.
As shown in fig. 4, a numerical value masking operation is exemplified in which five characters "4", "1", "", "8", and "2" are changed to "41.82" one character by a numerical value masking operation, and then a masking operation is performed with < Num >, by which a sentence sequence is shortened to length 17 by length 21, and the boundaries of numbers and surrounding characters are more clearly divided, which is equivalent to having an encoder determine which character portions are numerical information when input to a neural encoder.
S203, establishing an embedding layer, and mapping output data of the masking layer into a matrix formed by vectors by using the embedding layer;
the Embedding layer is also called an Embedding layer and mainly used for mapping the output M of the masking layer into a matrix Q formed by dense vectors.
Static word vector embedding: setting the dense vector length of each character of a Word vector file Vec trained by a Word2Vec model to be L, and outputting M ═ M in a mask layer1,m2,…,mxComparing the dictionary and M in MkConverting M into a word vector matrix Q (Q) by querying the word vector file Vec1,q2,...,qNN, wherein the length of N is L.
Dynamic word vector embedding: since BERT-WWM (Chinese pre-training model based on full-word Mask) is a pre-training model trained according to Chinese characteristics, the pre-training model of the embodiment adopts BERT-WWM, and since BERT-WWM is impossible to generate the model defined herein<Num>Special characters, so when inputting into BERT-WWM model to obtain word vector<Num>Special character mask into BERT vocabulary [ UNK ]]It means that the word is unknown,<Num>it becomes a nonsense unknown word and therefore, in order to allow the model to determine what is here a number, an arbitrary number present in the BERT vocabulary is used to replace it<Num>The number "9" present in the vocabulary is used herein as<Num>The replacement character of which the input is still M of the mask layer, wherein<Num>All substituted for the number "9"Then inputting the text into the BERT-WWM loaded with the pre-training according to a BERT-WWM segmentation rule, and outputting a matrix formed by character vectors of each character, wherein Q is equal to Q ═ Q1,q2,……,qNWherein q isvThe length of (v ═ 1,2, … …, N) is the length of the pre-trained model word vector.
S304, establishing a coding layer, and coding the output matrix of the embedded layer by using the coding layer to form a text sequence label:
the LSTM and the GRU belong to improved RNN, so that the long dependence problem is solved, and the LSTM and the GRU have no obvious advantages or disadvantages in the actual use process, and the specific situation needs to be determined according to the use effect. Therefore, the two coding modes considered in the coding layer of this embodiment are BILSTM and BIGRU, and the comparison of the specific test effects is performed before the selection.
The BILSTM coding layer is shown in equations 1-3.
Figure RE-GDA0003144798120000111
And
Figure RE-GDA0003144798120000112
then the hidden layer outputs in two directions, respectively representing the bi-directional LSTM, are placed into the output matrix by splicing
Figure RE-GDA0003144798120000113
kt∈R2*lWhere T is the length of the character sequence output by the mask layer (equal to the length of the character sequence output by the input layer), l is the dimension of the LSTM hidden layer, and T represents the T-th character, where
Figure RE-GDA0003144798120000114
An LSTM neural network representing the sequence flowing from left to right,
Figure RE-GDA0003144798120000115
the LSTM neural network representing the sequence flowing from right to left, m being the number of hidden layers.
Figure RE-GDA0003144798120000116
Figure RE-GDA0003144798120000117
Figure RE-GDA0003144798120000118
The BIGRU coding layer is shown in equations 4-6.
Figure RE-GDA0003144798120000119
And
Figure RE-GDA00031447981200001110
then the hidden layer outputs in two directions respectively representing bidirectional GRUs are put into an output matrix through splicing
Figure RE-GDA00031447981200001111
t represents the t-th character, l is the dimension of the hidden layer of GRU, wherein
Figure RE-GDA00031447981200001112
Representing a GRU neural network with the sequence flowing from left to right,
Figure RE-GDA00031447981200001113
the GRU neural network is shown in which the sequence flows from right to left, and m is the number of hidden layers.
Figure RE-GDA00031447981200001114
Figure RE-GDA00031447981200001115
Figure RE-GDA00031447981200001116
S205, a CRF output layer is established, and a dependency relationship between text sequence labels is established by utilizing the CRF output layer;
since the sequence annotation is classified by nature, if the softmax full-link layer is directly connected to the last layer for classification, but this situation considers that the type of each label is predicted to be independent, some cases of "illegal labels" may occur, for example, "I-Dweight" may occur in the middle of the predicted current character position of "I-Dlength" and "I-Dlength", which actually belongs to the case of illegal labels, because "I-Dlength" is preceded by either "I-Dlength" or "B-Dlength", and the result of the occurrence of other labels is obviously wrong, i.e., "illegal labels", so that not only the characteristics of the current position but also the types of labels predicted before the current position need to be considered when the actual sequence annotation is predicted, so that certain dependency is formed between the labels. The objective function of the CRF solves this problem because it inputs not only the current state features but also the transition features of the tag sequence, so that the dependency between tags is formed during the training of the actual neural network.
Suppose the coding layer output is X, A is the transition matrix, Ai,jRepresenting the transition probability of the ith label to the jth label, P is a transmission probability matrix, P isi,jThe non-normalized probability that the ith character q is mapped to the jth tag is shown, and the score (score) of the input sequence X and the output tag sequence y is calculated by formula (7).
Figure RE-GDA0003144798120000121
Then, the conditional probability of y is calculated as P (y | X) by utilizing the softmax function
Figure RE-GDA0003144798120000122
Taking the negative logarithm of the conditional probability as the final objective loss function:
Figure RE-GDA0003144798120000123
in model training, the minimization loss function will be optimized using an Adam optimizer.
S206, establishing an entity extraction layer, and performing entity extraction by using the entity extraction layer;
when a predicted label sequence is obtained, only one predicted label at the corresponding position of the character output by each mask layer is completed, and further operation is needed to complete actual entity extraction1,f2,…,fTAnd F, extracting entities from the corresponding sequence label TL according to the predicted sequence label list.
After obtaining the sequence tag TL, extracting the entity from the comparison F in the form of an algorithm 1, and the main idea is that the tag 'B-type' beginning with 'B' in the TL tag is matched, then the current position F is considered as the starting character of a 'type' entity, and through multiple rounds of judgment, the 'type' entity beginning with 'B-type' is extracted from F and stored in res, and then other types of entities are extracted according to the method.
Figure RE-GDA0003144798120000131
Figure RE-GDA0003144798120000141
The original models are named as BILSTM-CRF, BIGRU-CRF and BERT-BILSTM-CRF respectively, and the models based on the numerical MASK mechanism are named as BILSTM-MASK-CRF, BIGRU-MASK-CRF and BERT-BILSTM-MASK-CRF. The data sets were tested to obtain the test procedures shown in fig. 5, 6, and 7, respectively. As can be seen from fig. 5 and 6, after the numerical masking mechanism is introduced, both baseline models have a certain increase in F1 value, which is mainly reflected in that the model introduced with the numerical masking mechanism can reach a higher F1 value with a smaller epoch number, and as can be seen from fig. 5 and 6, the increase in F1 is mainly caused by the increase in Recall (Recall), fig. 5 (2) shows that the precision (precision) of BILTSM-MASK-CRF does not significantly increase with BILSTM-CRF at the training time, and both precision rates are almost the same, and fig. 6 shows that the precision (precision) of BILTSM-MASK-CRF increases faster with the same epoch, but finally tends to be stable, and both precision rates almost coincide. Since the introduced numerical mask layer only belongs to the data input part of the model and does not participate in specific model training, the training time of an epoch is consistent compared with that of the baseline model after the numerical mask layer is introduced. From the above analysis, it is concluded that the BILTSM-MASK-CRF and BIGRU-MASK-CRF incorporating the numerical masking mechanism can achieve better extraction results with fewer training steps, i.e., with less training time consumption, than the baseline model. FIG. 7 shows the prediction effect of the BERT-BILSTM-CRF and the BERT-BILSTM-MASK-CRF models introduced into the pretrained model BERT-WWM for each epoch in the test set, and it can be known through analysis that the model passing through the masking mechanism is substantially consistent with the baseline model in terms of accuracy, and the model passing through the masking mechanism still obviously improves the recall rate, which is mainly reflected in that a higher recall rate can be obtained in fewer training steps, so that on the comprehensive evaluation index F1 value, compared with the baseline model, the improved model can obtain a higher F1 value in fewer training steps, even though a better prediction effect is obtained with less training time.
Table 2 shows the best F1 values obtained by each model under sufficient training procedures, and it can be seen from Table 2 that the extraction effect of the model introduced with the numerical masking mechanism is improved compared with the baseline model, BILSTM-MASK-CRF is improved by 4.7 in F1 value compared with the baseline model, BIGRU-MASK-CRF is improved by 2.1 in comparison with the baseline model, and BERT-BILSTM-MASK-CRF is improved by 1.9. Table 3 shows the time taken by each model from loading the trained model to predicting the result model when testing a test corpus, and thus it can be seen that the pre-trained models take longer time for extraction in prediction, both pre-trained models reach as much as 49 seconds, while the models without the pre-trained models take as much as 3 seconds in prediction, which greatly reduces the time consumed compared with the pre-trained models.
TABLE 2 comparison of the best predicted Effect of the models
Method Precision ratio (%) Recall (%) F1(%)
BILSTM-CRF 56.4 41.0 47.5
BILSTM-MASK-CRF 55.4 48.2 51.6
BIGRU-CRF 58.6 45.0 50.9
BIGRU-MASK-CRF 58.1 48.7 53.0
BERT-BILSTM-CRF 77.2 79.5 78.3
BERT-BILSTM-MASK-CRF 82.3 78.1 80.2
TABLE 3 comparison of prediction times for each method
Method Time of use(s)
BILSTM-CRF 3.41
BILSTM-MASK-CRF 3.37
BIGRU-CRF 3.03
BIGRU-MASK-CRF 3.35
BERT-BILSTM-CRF 49.67
BERT-BILSTM-MASK-CRF 49.41
And 3, extracting the text to be extracted by using the trained entity extraction model to obtain the information of the damage test text.
After the trained model is obtained, the model is selected according to the actual use condition, if no requirement is made on the prediction time, the improved BERT-BILSTM-MASK-CRF model can be selected, and if the prediction time cannot be too long, the improved BIGRU-MASK-CRF model is used. The selected model can extract the entity of the input sentence, the entity extraction of the input text can be completed only by inputting the input text into the model after segmenting the input text by paragraphs, and the extracted entity can be converted into a table for storage after performing table mapping according to a manually defined rule.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (9)

1. A damage test text information extraction method is characterized in that: the method comprises the following steps:
constructing a damage test text labeling data set, and dividing the damage test text labeling data set into a training set and a testing set, wherein the damage test text is used for recording damage effects and damage evaluation research data, and the damage test text contains a large amount of numerical information;
constructing an entity extraction model by adopting a neural network based on a numerical mask mechanism, and training the entity extraction model by utilizing the training set, wherein the entity extraction model is used for extracting entities in the damage test text, and the entities are natural language processing units;
and extracting the text information of the test set by using the trained entity extraction model.
2. The method of claim 1, wherein the text information of the damage test comprises: the entity extraction model comprises an input layer, a masking layer, an embedding layer, an encoding layer, a CRF output layer and an entity extraction layer;
the input layer is used for segmenting the texts in the training set and outputting a text sequence; the mask layer is used for replacing numerical value information in the text sequence to obtain a mask layer output sequence;
the embedded layer is used for vectorizing the output sequence of the mask layer to obtain a word vector matrix;
the coding layer is used for coding the word vector matrix and outputting a label sequence;
the CRF output layer is used for establishing a dependency relationship among the labels in the label sequence;
the entity extraction layer is used for extracting the entities in the label sequence.
3. The method of claim 1, wherein the text information of the damage test comprises: the method for constructing the damage test text labeling data set comprises the following steps:
collecting a plurality of damage test files, and converting the damage test files into texts;
filtering the remote characters in the text by using a regular expression, reserving Chinese characters, numerical values and English characters in the text, segmenting the filtered text by taking sentences as units, and replacing all numeric strings in the filtered text by using specific character marks to obtain segmented text;
and determining the entity type of the damage experiment sample, and labeling the segmented text according to the entity type to obtain the damage experiment text labeling data set.
4. The method of claim 2, wherein the text information of the damage test comprises: the step of segmenting the texts in the training set by the input layer and outputting the text sequences comprises the following steps:
inputting the training set, segmenting the training set by taking characters as units to obtain a text sequence, and uniformly segmenting numerical information by adopting a regular expression, wherein the numerical information consists of continuous numerical characters in the text sequence to obtain a segmented sequence.
5. The method of claim 4, wherein the text information of the damage test comprises: the mask layer replaces numerical information in the text sequence to obtain a mask layer output sequence, and the mask layer output sequence comprises:
and inputting the cut sequence, and replacing the numerical information by adopting a specific character mark to obtain the output sequence of the mask layer.
6. The method of claim 2, wherein the text information of the damage test comprises: and constructing the embedding layer by adopting static word vector embedding or dynamic word vector embedding.
7. The method of claim 2, wherein the text information of the damage test comprises: and the BILSTM or BIGRU is adopted for constructing the coding layer.
8. The method of claim 1, wherein the text information of the damage test comprises: the entity extraction model training adopts an Adam optimizer, and the Adam optimizer is used for optimizing a loss function and minimizing the loss function.
9. The method of claim 3, wherein the text information of the damage test comprises: the entity types include, but are not limited to, projectile shape, projectile type, and projectile material type.
CN202110584191.0A 2021-05-27 2021-05-27 Method for extracting text information of damage test Pending CN113221569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110584191.0A CN113221569A (en) 2021-05-27 2021-05-27 Method for extracting text information of damage test

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110584191.0A CN113221569A (en) 2021-05-27 2021-05-27 Method for extracting text information of damage test

Publications (1)

Publication Number Publication Date
CN113221569A true CN113221569A (en) 2021-08-06

Family

ID=77098748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110584191.0A Pending CN113221569A (en) 2021-05-27 2021-05-27 Method for extracting text information of damage test

Country Status (1)

Country Link
CN (1) CN113221569A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118056A (en) * 2021-10-13 2022-03-01 中国人民解放军军事科学院国防工程研究院工程防护研究所 Information extraction method for war research report
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN115033702A (en) * 2022-03-04 2022-09-09 贵州电网有限责任公司 Transformer substation site selection knowledge extraction method based on ensemble learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010101540A1 (en) * 2009-03-02 2010-09-10 Panchenko Borys Evgenijovich Method for the fully modifiable framework distribution of data in a data warehouse taking account of the preliminary etymological separation of said data
CN111428490A (en) * 2020-01-17 2020-07-17 北京理工大学 Reference resolution weak supervised learning method using language model
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111611790A (en) * 2020-04-13 2020-09-01 华为技术有限公司 Data processing method and device
CN112101009A (en) * 2020-09-23 2020-12-18 中国农业大学 Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions
CN112633012A (en) * 2020-12-31 2021-04-09 浙大城市学院 Entity type matching-based unknown word replacing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010101540A1 (en) * 2009-03-02 2010-09-10 Panchenko Borys Evgenijovich Method for the fully modifiable framework distribution of data in a data warehouse taking account of the preliminary etymological separation of said data
CN111428490A (en) * 2020-01-17 2020-07-17 北京理工大学 Reference resolution weak supervised learning method using language model
CN111611790A (en) * 2020-04-13 2020-09-01 华为技术有限公司 Data processing method and device
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112101009A (en) * 2020-09-23 2020-12-18 中国农业大学 Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions
CN112633012A (en) * 2020-12-31 2021-04-09 浙大城市学院 Entity type matching-based unknown word replacing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118056A (en) * 2021-10-13 2022-03-01 中国人民解放军军事科学院国防工程研究院工程防护研究所 Information extraction method for war research report
CN115033702A (en) * 2022-03-04 2022-09-09 贵州电网有限责任公司 Transformer substation site selection knowledge extraction method based on ensemble learning
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN114818669B (en) * 2022-04-26 2023-06-27 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN108460089A (en) Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN113221569A (en) Method for extracting text information of damage test
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN109446423B (en) System and method for judging sentiment of news and texts
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN112417854A (en) Chinese document abstraction type abstract method
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113011161A (en) Method for extracting human and pattern association relation based on deep learning and pattern matching
Jiang et al. Impact of OCR quality on BERT embeddings in the domain classification of book excerpts
CN112685561A (en) Small sample clinical medical text post-structuring processing method across disease categories
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN113868422A (en) Multi-label inspection work order problem traceability identification method and device
CN114781381B (en) Standard index extraction method based on rule and neural network model fusion
CN116882402A (en) Multi-task-based electric power marketing small sample named entity identification method
CN116127097A (en) Structured text relation extraction method, device and equipment
CN115757695A (en) Log language model training method and system
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN116720502B (en) Aviation document information extraction method based on machine reading understanding and template rules

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210806