CN113221569A

CN113221569A - Method for extracting text information of damage test

Info

Publication number: CN113221569A
Application number: CN202110584191.0A
Authority: CN
Inventors: 张磊; 汪剑辉; 王继民; 孔德锋; 王幸; 王晓峰; 杨江
Original assignee: Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA
Current assignee: Institute of Engineering Protection National Defense Engineering Research Institute Academy of Military Sciences of PLA
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-06

Abstract

The invention discloses a damage test text information extraction method, which comprises the following steps: constructing a damage test text labeling data set, replacing all continuous numeric strings by a specific character mark, and dividing the damage test text data set into a training set and a testing set for training an information extraction model; constructing an entity extraction model adopting a neural network based on a numerical mask mechanism, and training the entity extraction model by using the training set; and extracting the text information of the test set by using the trained entity extraction model. According to the method for extracting the information of the damage test sample, the neural network algorithm is used for extracting the information of the damage test text, and various traditional models are improved according to the characteristic that the damage test text has a large amount of numerical information, so that the extraction effect of the models is improved.

Description

Method for extracting text information of damage test

Technical Field

The invention relates to the technical field of information, in particular to a method for extracting text information of a damage test.

Background

The test is the most direct and reliable means for developing damage effect and damage assessment research, but most of the test research results are published in an unstructured literature form, and how to collect and extract the test data therein and perform structured processing is an important problem to be solved urgently for the construction of large damage data.

Information extraction is the task of taking natural language text as input and generating structured information specified by some criteria. Various subtasks of information extraction, such as named entity recognition, relational extraction, coreference resolution, named entity linking, knowledge base reasoning, etc., constitute various downstream Natural Language Processing (NLP) tasks, such as machine translation, question answering systems, natural language understanding, text summarization, and intelligent machine assistants such as Siri and Cortana. The entity extraction can divide the boundary of an entity and determine the type of the entity, and the entity is one of the smallest units which natural language processing needs to understand, so the entity extraction is the most basic task of the natural language processing task, and a plurality of research methods mainly comprising a rule-based method, a statistical learning-based method and a deep learning-based method have been developed at present.

The rule-based approach does not require too many training samples, but its extraction recall is low. Therefore, research is mainly focused on statistical learning methods and deep learning-based methods, and due to the rapid development of deep learning models, most of the existing entity extraction models are deep learning models, but the deep learning models are large in calculation amount and low in accuracy. Under the condition of sufficient data quantity, the feature extraction capability of the simple statistical learning method is much poorer than that of the deep learning method, and if the deep learning method is simply utilized to identify the entity by utilizing the classification idea, the method can also be used for solving the problem of the special property of the damage field: the numeric characters have a high percentage, so that the overall extraction effect is poor.

Disclosure of Invention

The invention aims to provide a method for extracting damage test text information, which aims to solve the problems in the prior art, utilizes a neural network algorithm to extract the damage test text information, and utilizes the characteristic that the damage test text has a large amount of numerical information to improve various traditional models so as to improve the extraction effect of the models.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a damage test text information extraction method, which comprises the following steps:

constructing a damage test text labeling data set, and dividing the damage test text labeling data set into a training set and a testing set, wherein the damage test text is used for recording damage effects and damage evaluation research data, and the damage test text contains a large amount of numerical information;

constructing an entity extraction model by adopting a neural network based on a numerical mask mechanism, and training the entity extraction model by utilizing the training set, wherein the entity extraction model is used for extracting entities in the damage test text, and the entities are natural language processing units;

and extracting the text information of the test set by using the trained entity extraction model.

Preferably, the entity extraction model comprises an input layer, a masking layer, an embedding layer, an encoding layer, a CRF output layer and an entity extraction layer;

the input layer is used for segmenting the texts in the training set and outputting a text sequence; the mask layer is used for replacing numerical value information in the text sequence to obtain a mask layer output sequence;

the embedded layer is used for vectorizing the output sequence of the mask layer to obtain a word vector matrix;

the coding layer is used for coding the word vector matrix and outputting a label sequence;

the CRF output layer is used for establishing a dependency relationship among the labels in the label sequence;

the entity extraction layer is used for extracting the entities in the label sequence.

Preferably, the constructing of the damage test text annotation data set comprises the following steps:

collecting a plurality of damage test files, and converting the damage test files into texts;

filtering the remote characters in the text by using a regular expression, reserving Chinese characters, numerical values and English characters in the text, segmenting the filtered text by taking sentences as units, and replacing all numeric strings in the filtered text by using specific character marks to obtain segmented text;

and determining the entity type of the damage experiment sample, and labeling the segmented text according to the entity type to obtain the damage experiment text labeling data set.

Preferably, the segmenting, by the input layer, the text in the training set to output the text sequence includes:

inputting the training set, segmenting the training set by taking characters as units to obtain a text sequence, and uniformly segmenting numerical information by adopting a regular expression, wherein the numerical information consists of continuous numerical characters in the text sequence to obtain a segmented sequence.

Preferably, the replacing, by the mask layer, the numerical information in the text sequence to obtain a mask layer output sequence includes:

and inputting the cut sequence, and replacing the numerical information by adopting a specific character mark to obtain the output sequence of the mask layer.

Preferably, the embedding layer is constructed by adopting static word vector embedding or dynamic word vector embedding.

Preferably, the coding layer is constructed using BILSTM or BIGRU.

Preferably, the entity extraction model training employs an Adam optimizer, which is configured to optimize a loss function to minimize the loss function.

Preferably, the entity types include, but are not limited to, elastomeric shape, elastomeric type, elastomeric material type.

The invention discloses the following technical effects:

according to the method for extracting the damage test sample information, the neural network algorithm is used for extracting the information of the damage test text, and the characteristic that a large amount of numerical information exists in the damage test text is utilized to improve various traditional models, so that the extraction effect of the models is improved. The technology utilizes the characteristic that the damage test text has more numerical information, and adds a numerical mask mechanism to a common entity extraction model based on deep learning, so that the entity extraction model can greatly improve the training efficiency on the current damage test text compared with a model without the numerical mask mechanism, and the entity extraction model can be trained by using less epoch numbers to obtain an entity extraction model with higher F1 on a test set, and the F1 value of the stored optimal model on the test set is higher than that of the model without the numerical mask mechanism.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a method for extracting information based on a damage test sample according to the present invention;

FIG. 2 is a schematic diagram of an entity extraction model structure based on a numerical mask mechanism according to the present invention;

FIG. 3 is a schematic diagram of a flow of manual labeling of a data set according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a numerical value masking step in an embodiment of the present invention, where fig. (a) is a schematic diagram of an original text for DET labeling, fig. (b) is a schematic diagram of a numerical value as a character, and fig. (c) is a schematic diagram of a numerical value information masking operation;

FIG. 5 is a graph showing the results of the BILSTM-CRF and BILSTM-MASK-CRF assays of the present invention;

FIG. 6 is a graph showing the results of tests for the predicted effect of BIGRU-CRF and BIGRU-MASK-CRF according to the present invention;

FIG. 7 is a graph showing the results of the BERT-BILSTM-CRF and BERT-BILSTM-MASK-CRF assays of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides a method for extracting damage test text information, which comprises the following steps with reference to fig. 1:

step 1, collecting a plurality of damage test files, converting the damage test files into texts, filtering out remote characters in the texts by adopting a regular expression, only retaining Chinese characters, numerical values and English characters in the texts, carrying out text segmentation, constructing a text data set according to entity information to be retained, labeling the entity information, and dividing the labeled data set into a training set and a test set.

S11, collecting 198 damage test papers in this embodiment, extracting damage test texts by using a text tool, filtering the extracted texts by using a regular expression, and then segmenting the filtered texts by using sentences as units. Determining the entity types to be extracted according to the expert opinions, retaining entity information, finally constructing a data set comprising 865 long texts, forming a corpus by the data set, and summarizing the retained entity types in the corpus, wherein the entity types refer to table 1:

table 1 damage test text test entity type table

And S12, manually labeling the data set obtained in S11 according to the labeling method shown in the figure 3, wherein the labeling is carried out in a double-person labeling and one-person auditing manner, namely, two researchers A and B who master the relevant knowledge in the field label the data respectively, the other researcher C examines the labeled data, if the examination does not pass, the current data is labeled by the other researcher, if the examination still does not pass, the current data is stored, and then the three researchers discuss together and label the data. And 2224 labeled entities are finally obtained. The labeled corpus data set is referred to as "damage test text" herein, abbreviated as det (dam experimental text), the labeled data set format is stored in an IOB label format, and the data set format is shown in fig. 4 (a). The annotated entities are analyzed and given the label at the time of IOB labeling and the number of entities in the corpus for each entity, as shown in table (1).

(4) The labeled data set was randomly divided into a training set and a test set at a 7:3 ratio.

And 2, constructing an entity extraction model by adopting Keras (high-level neural network API), training the entity extraction model, and extracting entity information in the damage test text by using the entity extraction model.

As shown in fig. 2, the entity extraction model in the present invention includes an input layer, a mask layer, an embedding layer, an encoding layer, a CRF output layer, and an entity extraction layer.

S201, processing a text to be input by using an input layer;

the input layer is a "chinese text", in this embodiment, the input layer is a certain paragraph D in the training set, and the text is segmented by using characters as segmentation units to obtain a text sequence D ═ X₁,X₂,…,X_SIn which X is_i(i ═ 1,2, … … S) denotes characters constituting a text sequence. When the text sequence D is divided into sequences by taking characters as units, coherent numerical characters are uniformly divided into numerical information by using a regular expression, the numerical characters represent characters expressing numerical meanings and not only include Arabic numerals but also Chinese number case, and the character sequence after being divided is F ═ F₁,f₂,…,f_TIn which f_j(j ═ 1,2, … … T) represents sequence constituent elements, and T ≦ S.

S202, establishing a mask layer, processing an output text sequence of an input layer by using the mask layer through a mask mechanism, and obtaining an output result;

the input of the mask layer is the output F of the input layer, and all the numerical information in F is identified by special characters<Num>Instead, this step is called a "masked number" and the output is M ═ M₁,m₂,…,m_x},m_k(k ═ 1,2, … … x) represents a constituent element of M. Due to the special characteristics of the damaged text, a large number of numerical characters exist in sentences and paragraphs, and the continuous numerical characters occupy a relatively large length of the sequence, so that the difficulty of learning sequence labeling of the conventional neural networks BILSTM and BIGRU is increased. The operation of the 'mask numerical value' greatly shortens the whole sequence of the text and divides the boundary of the 'numerical information' in advance on the premise of not influencing the semantic meaning of the text, so that the network can not only correspondingly reduce the error rate of judging the numerical boundary, but also determine which entity type the current 'numerical information' entity belongs to with less training steps and training time. This "value masking" operation may therefore result in the neural network encoder learning better the text semantic rules.

As shown in fig. 4, a numerical value masking operation is exemplified in which five characters "4", "1", "", "8", and "2" are changed to "41.82" one character by a numerical value masking operation, and then a masking operation is performed with < Num >, by which a sentence sequence is shortened to length 17 by length 21, and the boundaries of numbers and surrounding characters are more clearly divided, which is equivalent to having an encoder determine which character portions are numerical information when input to a neural encoder.

S203, establishing an embedding layer, and mapping output data of the masking layer into a matrix formed by vectors by using the embedding layer;

the Embedding layer is also called an Embedding layer and mainly used for mapping the output M of the masking layer into a matrix Q formed by dense vectors.

Static word vector embedding: setting the dense vector length of each character of a Word vector file Vec trained by a Word2Vec model to be L, and outputting M ═ M in a mask layer₁,m₂,…,m_xComparing the dictionary and M in M_kConverting M into a word vector matrix Q (Q) by querying the word vector file Vec₁,q₂,...,q_NN, wherein the length of N is L.

Dynamic word vector embedding: since BERT-WWM (Chinese pre-training model based on full-word Mask) is a pre-training model trained according to Chinese characteristics, the pre-training model of the embodiment adopts BERT-WWM, and since BERT-WWM is impossible to generate the model defined herein<Num>Special characters, so when inputting into BERT-WWM model to obtain word vector<Num>Special character mask into BERT vocabulary [ UNK ]]It means that the word is unknown,<Num>it becomes a nonsense unknown word and therefore, in order to allow the model to determine what is here a number, an arbitrary number present in the BERT vocabulary is used to replace it<Num>The number "9" present in the vocabulary is used herein as<Num>The replacement character of which the input is still M of the mask layer, wherein<Num>All substituted for the number "9"Then inputting the text into the BERT-WWM loaded with the pre-training according to a BERT-WWM segmentation rule, and outputting a matrix formed by character vectors of each character, wherein Q is equal to Q ═ Q₁,q₂,……,q_NWherein q is_vThe length of (v ═ 1,2, … …, N) is the length of the pre-trained model word vector.

S304, establishing a coding layer, and coding the output matrix of the embedded layer by using the coding layer to form a text sequence label:

the LSTM and the GRU belong to improved RNN, so that the long dependence problem is solved, and the LSTM and the GRU have no obvious advantages or disadvantages in the actual use process, and the specific situation needs to be determined according to the use effect. Therefore, the two coding modes considered in the coding layer of this embodiment are BILSTM and BIGRU, and the comparison of the specific test effects is performed before the selection.

The BILSTM coding layer is shown in equations 1-3.

And

then the hidden layer outputs in two directions, respectively representing the bi-directional LSTM, are placed into the output matrix by splicing

k_t∈R^2*lWhere T is the length of the character sequence output by the mask layer (equal to the length of the character sequence output by the input layer), l is the dimension of the LSTM hidden layer, and T represents the T-th character, where

An LSTM neural network representing the sequence flowing from left to right,

the LSTM neural network representing the sequence flowing from right to left, m being the number of hidden layers.

The BIGRU coding layer is shown in equations 4-6.

And

then the hidden layer outputs in two directions respectively representing bidirectional GRUs are put into an output matrix through splicing

t represents the t-th character, l is the dimension of the hidden layer of GRU, wherein

Representing a GRU neural network with the sequence flowing from left to right,

the GRU neural network is shown in which the sequence flows from right to left, and m is the number of hidden layers.

S205, a CRF output layer is established, and a dependency relationship between text sequence labels is established by utilizing the CRF output layer;

since the sequence annotation is classified by nature, if the softmax full-link layer is directly connected to the last layer for classification, but this situation considers that the type of each label is predicted to be independent, some cases of "illegal labels" may occur, for example, "I-Dweight" may occur in the middle of the predicted current character position of "I-Dlength" and "I-Dlength", which actually belongs to the case of illegal labels, because "I-Dlength" is preceded by either "I-Dlength" or "B-Dlength", and the result of the occurrence of other labels is obviously wrong, i.e., "illegal labels", so that not only the characteristics of the current position but also the types of labels predicted before the current position need to be considered when the actual sequence annotation is predicted, so that certain dependency is formed between the labels. The objective function of the CRF solves this problem because it inputs not only the current state features but also the transition features of the tag sequence, so that the dependency between tags is formed during the training of the actual neural network.

Suppose the coding layer output is X, A is the transition matrix, A_i,jRepresenting the transition probability of the ith label to the jth label, P is a transmission probability matrix, P is_i,jThe non-normalized probability that the ith character q is mapped to the jth tag is shown, and the score (score) of the input sequence X and the output tag sequence y is calculated by formula (7).

Then, the conditional probability of y is calculated as P (y | X) by utilizing the softmax function

Taking the negative logarithm of the conditional probability as the final objective loss function:

in model training, the minimization loss function will be optimized using an Adam optimizer.

S206, establishing an entity extraction layer, and performing entity extraction by using the entity extraction layer;

when a predicted label sequence is obtained, only one predicted label at the corresponding position of the character output by each mask layer is completed, and further operation is needed to complete actual entity extraction₁,f₂,…,f_TAnd F, extracting entities from the corresponding sequence label TL according to the predicted sequence label list.

After obtaining the sequence tag TL, extracting the entity from the comparison F in the form of an algorithm 1, and the main idea is that the tag 'B-type' beginning with 'B' in the TL tag is matched, then the current position F is considered as the starting character of a 'type' entity, and through multiple rounds of judgment, the 'type' entity beginning with 'B-type' is extracted from F and stored in res, and then other types of entities are extracted according to the method.

The original models are named as BILSTM-CRF, BIGRU-CRF and BERT-BILSTM-CRF respectively, and the models based on the numerical MASK mechanism are named as BILSTM-MASK-CRF, BIGRU-MASK-CRF and BERT-BILSTM-MASK-CRF. The data sets were tested to obtain the test procedures shown in fig. 5, 6, and 7, respectively. As can be seen from fig. 5 and 6, after the numerical masking mechanism is introduced, both baseline models have a certain increase in F1 value, which is mainly reflected in that the model introduced with the numerical masking mechanism can reach a higher F1 value with a smaller epoch number, and as can be seen from fig. 5 and 6, the increase in F1 is mainly caused by the increase in Recall (Recall), fig. 5 (2) shows that the precision (precision) of BILTSM-MASK-CRF does not significantly increase with BILSTM-CRF at the training time, and both precision rates are almost the same, and fig. 6 shows that the precision (precision) of BILTSM-MASK-CRF increases faster with the same epoch, but finally tends to be stable, and both precision rates almost coincide. Since the introduced numerical mask layer only belongs to the data input part of the model and does not participate in specific model training, the training time of an epoch is consistent compared with that of the baseline model after the numerical mask layer is introduced. From the above analysis, it is concluded that the BILTSM-MASK-CRF and BIGRU-MASK-CRF incorporating the numerical masking mechanism can achieve better extraction results with fewer training steps, i.e., with less training time consumption, than the baseline model. FIG. 7 shows the prediction effect of the BERT-BILSTM-CRF and the BERT-BILSTM-MASK-CRF models introduced into the pretrained model BERT-WWM for each epoch in the test set, and it can be known through analysis that the model passing through the masking mechanism is substantially consistent with the baseline model in terms of accuracy, and the model passing through the masking mechanism still obviously improves the recall rate, which is mainly reflected in that a higher recall rate can be obtained in fewer training steps, so that on the comprehensive evaluation index F1 value, compared with the baseline model, the improved model can obtain a higher F1 value in fewer training steps, even though a better prediction effect is obtained with less training time.

Table 2 shows the best F1 values obtained by each model under sufficient training procedures, and it can be seen from Table 2 that the extraction effect of the model introduced with the numerical masking mechanism is improved compared with the baseline model, BILSTM-MASK-CRF is improved by 4.7 in F1 value compared with the baseline model, BIGRU-MASK-CRF is improved by 2.1 in comparison with the baseline model, and BERT-BILSTM-MASK-CRF is improved by 1.9. Table 3 shows the time taken by each model from loading the trained model to predicting the result model when testing a test corpus, and thus it can be seen that the pre-trained models take longer time for extraction in prediction, both pre-trained models reach as much as 49 seconds, while the models without the pre-trained models take as much as 3 seconds in prediction, which greatly reduces the time consumed compared with the pre-trained models.

TABLE 2 comparison of the best predicted Effect of the models

Method	Precision ratio (%)	Recall (%)	F1(％)
				BILSTM-CRF	56.4	41.0	47.5
BILSTM-MASK-CRF	55.4	48.2	51.6
				BIGRU-CRF	58.6	45.0	50.9
BIGRU-MASK-CRF	58.1	48.7	53.0
				BERT-BILSTM-CRF	77.2	79.5	78.3
BERT-BILSTM-MASK-CRF	82.3	78.1	80.2

TABLE 3 comparison of prediction times for each method

Method	Time of use(s)
		BILSTM-CRF	3.41
BILSTM-MASK-CRF	3.37
		BIGRU-CRF	3.03
BIGRU-MASK-CRF	3.35
		BERT-BILSTM-CRF	49.67
BERT-BILSTM-MASK-CRF	49.41

And 3, extracting the text to be extracted by using the trained entity extraction model to obtain the information of the damage test text.

After the trained model is obtained, the model is selected according to the actual use condition, if no requirement is made on the prediction time, the improved BERT-BILSTM-MASK-CRF model can be selected, and if the prediction time cannot be too long, the improved BIGRU-MASK-CRF model is used. The selected model can extract the entity of the input sentence, the entity extraction of the input text can be completed only by inputting the input text into the model after segmenting the input text by paragraphs, and the extracted entity can be converted into a table for storage after performing table mapping according to a manually defined rule.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A damage test text information extraction method is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein the text information of the damage test comprises: the entity extraction model comprises an input layer, a masking layer, an embedding layer, an encoding layer, a CRF output layer and an entity extraction layer;

3. The method of claim 1, wherein the text information of the damage test comprises: the method for constructing the damage test text labeling data set comprises the following steps:

4. The method of claim 2, wherein the text information of the damage test comprises: the step of segmenting the texts in the training set by the input layer and outputting the text sequences comprises the following steps:

5. The method of claim 4, wherein the text information of the damage test comprises: the mask layer replaces numerical information in the text sequence to obtain a mask layer output sequence, and the mask layer output sequence comprises:

6. The method of claim 2, wherein the text information of the damage test comprises: and constructing the embedding layer by adopting static word vector embedding or dynamic word vector embedding.

7. The method of claim 2, wherein the text information of the damage test comprises: and the BILSTM or BIGRU is adopted for constructing the coding layer.

8. The method of claim 1, wherein the text information of the damage test comprises: the entity extraction model training adopts an Adam optimizer, and the Adam optimizer is used for optimizing a loss function and minimizing the loss function.

9. The method of claim 3, wherein the text information of the damage test comprises: the entity types include, but are not limited to, projectile shape, projectile type, and projectile material type.