CN113836919A

CN113836919A - Building industry text error correction method based on transfer learning

Info

Publication number: CN113836919A
Application number: CN202111160118.7A
Authority: CN
Inventors: 侯振国; 何海英; 张中善; 杨伟涛; 李佳男; 张传浩; 张培聪; 孙维东; 阴栋阳
Original assignee: China Construction Seventh Engineering Division Corp Ltd; General Contracting Co Ltd of China Construction Seventh Engineering Division Corp Ltd
Current assignee: China Construction Seventh Engineering Division Corp Ltd; General Contracting Co Ltd of China Construction Seventh Engineering Division Corp Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-12-24

Abstract

The invention provides a building industry text error correction method based on transfer learning, which is used for solving the problem that text error correction tasks are difficult due to lack of multi-field composite document data represented by building construction scheme documents in the prior art. The method comprises the steps of firstly establishing a building document corpus data set and a non-label related field data set, then carrying out text error correction on the data set by using a BERT model, applying the pretrained BERT model to the non-label related field data set by adopting a transfer learning method to obtain vocabulary information from different subdivided fields, and finally extracting training samples in the building document corpus data set to retrain the transfer learned BERT model so as to be suitable for a text error correction task in the building industry. In order to dynamically adjust the pre-training task aiming at different training stages, a pre-training coefficient is introduced to improve the performance of the word order correction task. The invention has higher accuracy, recall rate and lower false alarm rate on the error correction task.

Description

Building industry text error correction method based on transfer learning

Technical Field

The invention relates to the technical field of text error correction in the building industry, in particular to a text error correction method based on transfer learning in the building industry.

Background

The building industry is the prop industry of national economy, and has no alternative status and function in social production and material life. In recent years, with the rapid development of the construction industry in China, the scale of the construction industry is continuously enlarged, and the demand of the construction market is increasingly increased, such as industrial buildings, civil buildings, public buildings and the like. The building construction scheme plays a role in building construction, and the feasibility and the rationality of the scheme are ensured by comprehensively considering various factors on the basis of a large amount of theoretical analysis and field investigation. With the development of informatization of the building industry, the building construction scheme is also transited from paperization to electronization, but the problem of text errors in the scheme is inevitably brought about. The errors cause the text information to be inaccurate and not authoritative, even result in grammar errors and semantic errors of sentences, reduce the readability of texts in the building industry, and generate ambiguity and cause loss when the texts are serious.

Text error correction is one of the most common tasks of Natural Language Processing (NLP) and has important value in the auditing of texts in the building industry. For a compound text of a multi-domain proper noun such as a text in the building industry, the deficiency of tagged data can cause the error correction algorithm based on deep learning to have poor effect. For massive text data in the building industry, only manual review is unrealistic, so that automatic detection and correction of text content in the building industry have important significance for text review in the building industry, the correctness and normalization of text language use in the building industry are improved, and manual workload is reduced.

Disclosure of Invention

Aiming at the problem that text error correction task is difficult due to lack of multi-field compound document data represented by building construction scheme documents in the prior art, the invention provides a building industry text error correction method based on transfer learning.

In order to solve the technical problems, the invention adopts the following technical scheme: a building industry text error correction method based on transfer learning comprises the following steps:

step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set.

Step S2: and performing text error correction on the text data set by using the BERT model, and pre-training the BERT model.

Step S3: and entering a first stage, training the pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields.

Step S4: and after the first stage is finished, the second stage is started, partial data in the corpus data set of the building document is extracted as training samples, and the training samples are used for retraining the BERT model after the migration learning, so that the BERT model is suitable for text error correction tasks in the building industry.

The pre-training tasks of the BERT model comprise an MLM task and an NSP task, and in the pre-training process of the BERT model, in order to minimize a combined loss function of the two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )₁,θ₂) The mathematical expression of (a) is:

L(θ,θ₁,θ₂)＝L₁(θ,θ₁)+L₂(θ,θ₂) (1)

in the formula: θ is a parameter of the encoder portion of the BERT model; theta₁Is a parameter in the output layer connected to the encoder in the MLM task; theta₂Is the classifier parameters on the encoder side in the NSP task; l is₁(θ,θ₁) Representing a loss function of training of the MLM task; l is₂(θ,θ₂) Representing the loss function of the NSP task for training.

The MLM task is to randomly cover partial words of an input sentence and predict the covered words by adopting an unsupervised learning method; loss function L in training of MLM task₁(θ,θ₁) The mathematical expression of (a) is:

in the formula: i represents the element order of traversal; m represents a predicted word; m is_iIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set (number) of input sentences in the MLM task, wherein the words are randomly covered; | V | represents a dictionary size.

Loss function L during training of NSP task₂(θ,θ₂) The mathematical expression of (a) is:

in the formula: j represents the element order of traversal; n represents the word order prediction result; n is_jRepresents the correct word in order j; n represents the number of sentences; IsNext means correct in language order; NotNext refers to an error in language order.

Introducing a pre-training coefficient lambda to dynamically adjust a pre-training task of the BERT model aiming at different stages to obtain a loss function L' (theta )₁,θ₂) The mathematical expression of (a):

L′(θ,θ₁,θ₂)＝L₁(θ,θ₁)+λL₂(θ,θ₂),λ＝0 or 1 (4)

in the first phase, let λ be 0, the MLM task of the BERT model is trained.

In the second stage, let λ be 1, the MLM task and NSP task of the BERT model are jointly trained.

The text error correction task in the building industry comprises missing character completion, redundant character removal, wrong character error correction and word order correction.

The invention has the beneficial effects that: the invention is based on the BERT model and adopts a transfer learning method, so that the BERT model after pre-training is learned from data sets of other related fields without labels, and then the BERT model is retrained again through training samples of the construction document corpus data set, so that the BERT model can obtain better error correction effect on the error correction of texts in the construction industry; meanwhile, in the process of pretraining, the BERT model introduces pretraining coefficients in order to dynamically adjust pretraining tasks in different training stages, so as to improve the performance of the word order correction tasks. Compared with other existing models, the error correction model provided by the invention has higher accuracy rate, recall rate and lower false alarm rate on the error correction task.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a BERT model architecture;

FIG. 2 is a pre-training phase process of the BERT model;

FIG. 3 is an input diagram of the BERT model;

FIG. 4 is a schematic architecture of the present invention;

FIG. 5 is a flow chart of transfer learning for dynamically adjusting a pre-training task;

FIG. 6 shows the results of ablation experiments on the BERT model of the present invention with model M1, model M2, and model M3, respectively; wherein (a) is a false alarm rate comparison graph; (b) the accuracy is compared with a graph; (c) a recall ratio comparison graph; (d) is an F-Measure (F metric) comparison graph.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

The invention provides a building industry text error correction method based on transfer learning, which comprises the following steps:

step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set. The building document corpus data set is used as a target data set, and the target data set is utilized to train and verify the capability of the BERT model in the aspect of building field text error correction.

In the present embodiment, a large number of construction plan documents from the construction industry are used, including nearly ten thousand statements covering proper nouns in different fields including geography, geology, chemical materials, construction equipment, construction regulations, and the like. Representative 5400 sentences are selected from the nearly ten thousand sentences and are divided into five groups, and each group has about 1000 sentences. One group of the model is not modified at all and is used for training and verifying the BERT model; and the other four groups are manually modified according to different text error types. The five groups of corpora are named as a building document corpus data set and are regarded as a target data set, the target data set is subdivided into a training sample set and a test sample set, then the training sample set is used for training the capability of the model in the aspect of building field text error correction, and the test sample set is used for verifying the capability of the model in the aspect of building field text error correction. Because the target task has fewer data sets and is rich in proper nouns from different fields, the method for attempting transfer learning in this embodiment utilizes the unlabeled data of the related field, and enables the BERT model to obtain the vocabulary recognition capability closer to the target task by training the BERT model and fine-tuning the related parameters in the training process. In this embodiment, 1538 professional documents in five fields of geography, geology, chemical materials, construction equipment, building regulations and the like are selected from Chinese document databases of the Chinese network of knowledge, the Wanfang and the like, which total about 42 ten thousand characters, form a data set of related fields without labels, and provide a training and learning library for the BERT model.

Four text error types, namely missing word completion, redundant word removal, wrong word correction and word order correction, are mainly concerned in the text error correction task of the building industry, and are shown in table 1:

TABLE 1 error examples and types

Namely, when the other four groups are modified manually, the four text error types in the table 1 are modified respectively. The set of sentences which are not modified by human is corrected by using the BERT model, and then the four sets of modified text sentences can be compared to check the error correction capability of the BERT model. When the group of unmodified text sentences is input, no error or one or more errors are possible, and if no error is found in the BERT model, the program automatically returns to be correct; otherwise, the program automatically returns an error, and displays the position of the error and the corrected statement.

Step S2: and performing text error correction on the text data set by using the BERT model, and pre-training the BERT model. BERT is a pre-trained language model for various Natural Language Processing (NLP) tasks, with bi-directional coding and feature extraction capabilities. As shown in fig. 1, the BERT model architecture is divided into an Embedding Layer, a transform Layer, and a Prediction Layer. As shown in fig. 2, the pretraining task of the BERT model includes two unsupervised prediction tasks, mlm (masked lm) and nsp (next sequence prediction). The MLM task randomly MASKs 15% of words in an input sentence, and predicts the masked words by adopting an unsupervised learning method, because the fine-tuning (fine-tuning) stage has no [ MASK ] token (MASK words), the pre-training and the fine-tuning are not matched, and in order to reduce the influence as much as possible, the strategy adopted by the BERT model is as follows: for these 15% of the words that are masked, there is an 80% probability of replacing [ MASK ], a 10% probability of replacing random words, and a 10% probability of not changing the current word. The nature of the NSP task is a two-classification task, in which the BERT model receives as input pairs of sentences (e.g., a and B) and then predicts whether B is the next sentence of a during the pre-training of the BERT model. There is a 50% probability that B is the next sentence in a and a 50% probability that B is a random sentence in the corpus.

In the pre-training process of the BERT model, in order to minimize the combined loss function of two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )₁,θ₂) The mathematical expression of (a) is:

L(θ,θ₁,θ₂)＝L₁(θ,θ₁)+L₂(θ,θ₂) (1)

Loss function L for the first part of the MLM task₁(θ,θ₁) If the masked word set is M, since it is a multi-classification problem on a dictionary size | V |, the loss function used is a negative log-likelihood function. And because it needs to be minimized, the loss function of the first part is equivalent to the maximum log-likelihood function, i.e. the loss function L in MLM task training₁(θ,θ₁) The mathematical expression of (a) is:

in the formula: i represents the element order of traversal; m represents a predicted word; m is_iIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set (number-randomly selected number of covering words) in which the input sentence in the MLM task is randomly covered; | V | represents a dictionary size.

Loss function L for the second partial NSP task₂(θ,θ₂) Since the NSP task is a classification problem, the loss function L in the training of the NSP task₂(θ,θ₂) The mathematical expression of (a) is:

Further, in order to dynamically adjust the pre-training task of the BERT model for different stages, a pre-training coefficient λ is introduced in the present embodiment, and a loss function L' (θ, θ) is obtained₁,θ₂) The mathematical expression of (a):

L′(θ,θ₁,θ₂)＝L₁(θ,θ₁)+λL₂(θ,θ₂),λ＝0 or 1 (4)

fine tuning is performed on the basis of a BERT pre-training model for different Natural Language Processing (NLP) tasks. And (3) inserting the input and the output of a specific task into a BERT model, and simulating a downstream task by using a Transformer powerful attention mechanism. As shown in fig. 3, the input to the BERT model contains the sum of three vectors: token 12, Segment 12, and Position 12, wherein Token converts each word into a fixed-dimension vector, Segment is used in NSP tasks to distinguish a first sentence from a second sentence, and Position represents the order of words by encoding Position information. In addition, the BERT model adds two special words, [ CLS ] for the classification task (non-classification tasks can be ignored) and [ SEP ] for separating two sentences.

Step S3: and starting to enter a first stage, training the pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields. The method is characterized in that a pretrained BERT model is applied to a label-free related field data set by adopting a transfer learning method, a large amount of label-free data corpora are subjected to unsupervised training to enable the BERT model to learn information such as a plurality of prior languages, syntax and word senses, and the learned characteristics are used in downstream tasks.

Step S4: and after the first stage is finished, the second stage is started, partial data in the corpus data set of the building document is extracted as training samples, namely the training samples in the training sample set of the target data set, and the training samples are used for retraining the BERT model after the transfer learning, so that the BERT model is more suitable for the text error correction task of the building industry related to multi-field texts, and the detection and correction capability of text errors is improved. The specific process is shown in fig. 4.

Of the two pre-training tasks of the BERT model, the MLM task is to predict masked words and the NSP task is to understand the relationship between sentences to predict the order between two sentences. After unsupervised training of a large number of label-free related field data sets, the capacity of the MLM task is remarkably improved, and proper nouns from different fields can be accurately predicted in a target data set. However, since the unlabeled related domain data set is a whole article from each domain, the sentence context of the same domain has strong correlation, while the building construction plan document of the target data set is a compound document relating to multiple domains, and the upper and lower sentences may come from different domains. In this case, the NSP task in the first stage will have negative effects, and finally recognize the correct language order as the wrong language order. In order to solve the problem of misreport of the word sequence, the NSP task is chosen at different stages in the embodiment. In the first stage, when performing unsupervised training by using the unlabeled related domain data set, regarding formula (4), let λ be 0, remove the NSP task, and train only the MLM task, so as to enhance the learning of the model on the proper nouns in different domains and ensure that the model is not affected by the strong related statements in the same domain. In the second stage, regarding formula (4), let λ be 1, unlock training of NSP, and perform joint training on the MLM task and the NSP task. After the fine tuning training of the target data set, the word sequence false alarm rate of the model can be effectively reduced, and the performance of the word sequence correction task is improved. The flow of the transfer learning for dynamically adjusting the pre-training task is shown in fig. 5.

In the embodiment, the accuracy and the robustness of the building industry text error correction method based on the transfer learning are verified through experiments. First, an experimental environment was set up, as shown in table 2:

TABLE 2 Experimental Environment

Then, an experimental evaluation index is determined. Table 3 shows the confusion matrix for evaluating performance indicators:

TABLE 3 confusion matrix

Where tp (true positive) refers to the number of sentences actually having errors marked as having errors, fn (false negative) refers to the number of sentences actually having errors marked as having no errors, fp (false positive) refers to the number of sentences actually having no errors marked as having errors, and tn (true negative) refers to the number of sentences actually having no errors marked as having no errors.

In this embodiment, false alarm Rate (FPR), precision Rate (P), recall Rate (R), F-Measure (F) are used₁) As an experimental evaluation index. The specific calculation formula of these indices is as follows:

for the division of the target data set, 70% of the corpus data set of the construction document may be set as a training sample set, and the remaining 30% may be set as a testing sample set. In order to verify the accuracy and robustness of the error correction method provided by the invention, firstly, the BERT model in the invention is compared with a commonly used deep learning text error correction model in an experiment, such as an LSTM model, a transducer _ self-attack model, a Seq2 Seq-attack model and the like. The four models were tested through test samples in the corpus dataset of construction documents, and the results are shown in table 4:

table 4 results of different model experiments

As can be seen from Table 4, the BERT model of the present invention achieves a higher improvement in accuracy rate than the conventional deep learning text error correction model, and the accuracy rate on the target data set reaches 81.4%.

The BERT model of the present invention was then used to perform ablation experiments with three other models. The three models are respectively: (1) freezing parameters of the pre-trained BERT model, and directly applying the parameters to a target text error correction task, wherein the parameters are set as M1; (2) after the pre-trained BERT model is subjected to fine adjustment through training samples in the building document corpus data set, the pre-trained BERT model is applied to a target text error correction task and is set as M2; (3) the parameters of the pre-trained BERT model are frozen after the transfer learning of the related field data set, and the parameters are directly applied to a target text error correction task and are set as M3. FIG. 6 is a comparison graph of ablation experiment results, wherein (a) is a comparison graph of false alarm rates; (b) the accuracy is compared with a graph; (c) a recall ratio comparison graph; (d) is an F-Measure comparison graph.

As can be seen from FIG. 6, the text error correction task capability of the target data set is improved significantly by the method provided by the present invention. As shown in fig. 6(a), since there are many different domain proper nouns in the target data set, if the pre-trained BERT model is directly used for text error correction without other operations, the FPR is as high as 19.8%. If the BERT model is subjected to fine tuning of the target data set, FPR is slightly reduced, because the target data set has fewer training samples and the training effect is not ideal. After the transfer learning of the label-free related field data set, the BERT model learns a large number of proper nouns from different fields, and then the BERT model is subjected to fine tuning, so that the FPR of the BERT model is remarkably reduced by only 9.7%, and the robustness of the BERT model is improved. As can be seen from fig. 6(b) and 6(c), the migration-learned and fine-tuned BERT model improves error localization and correction capability, which is shown in the present invention, the BERT model improves accuracy by 13% and recalling rate by 12.8% compared with the pretrained BERT model. Furthermore, from the experimental comparison of M2 with M1 and the experimental comparison of the model proposed by the present invention with M3, it can be concluded that fine tuning for the target data set is important. The comparison of the experimental results of M3 and M1 shows that the effect of the transfer learning of the unlabeled related field data set on the text error correction task is obviously improved, which shows that the transfer learning is greatly improved for the condition of lack of the target data set.

Finally, in order to verify that the method provided by the invention has obvious progress on the word order correction task of the target data set, a BERT model (the model provided by the invention) which is subjected to fine tuning after migration and is provided with NSP task in the first stage and a BERT model (set as M4) which is subjected to fine tuning after migration and is provided with NSP task in the first stage are subjected to experimental comparison, and the false alarm rate is used as an evaluation index of the word order correction task. And classifying the experimental results according to different text error types to obtain the false alarm rate of the word order correction task. The final experimental results are shown in table 5:

TABLE 5 comparison of different model word order correction tasks

As can be seen from table 5, compared with M4, the BERT model with post-migration fine tuning and NSP task removal in the first stage reduces the false positive rate of the word order correction task by 1.8%, which indicates that removing the NSP task can effectively reduce the false positive rate of the word order.

Because the error of the text in the building industry is difficult to locate by a common method, the text error correction method in the building industry based on the transfer learning provided by the invention carries out the transfer learning by the unsupervised training of the data set in the label-free related field on the basis of the pretrained BERT model, and then retrains the BERT model after the transfer learning by utilizing the training sample in the building document corpus data set to complete the text error correction task. Meanwhile, in order to improve the performance of the word order correction task, the NSP task is chosen at different stages. Finally, the experimental results are analyzed from a plurality of angles, and the accuracy and the robustness of the method provided by the invention are verified.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A building industry text error correction method based on transfer learning is characterized by comprising the following steps:

step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set;

step S2: performing text error correction on the text data set by using a BERT model, and performing pre-training on the BERT model;

step S3: entering a first stage, training a pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields;

2. The building industry text error correction method based on transfer learning of claim 1, wherein: the pre-training tasks of the BERT model comprise an MLM task and an NSP task, and in the pre-training process of the BERT model, in order to minimize a combined loss function of the two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )₁,θ₂) The mathematical expression of (a) is:

L(θ,θ₁,θ₂)＝L₁(θ,θ₁)+L₂(θ,θ₂) (1)

3. The building industry text error correction method based on transfer learning of claim 2, wherein: the MLM task is to the input sentenceSub-randomly masking partial words, and predicting the masked words by adopting an unsupervised learning method; loss function L in training of MLM task₁(θ,θ₁) The mathematical expression of (a) is:

in the formula: i represents the element order of traversal; m represents a predicted word; m is_iIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set in which input sentences in the MLM task are randomly covered; | V | represents a dictionary size.

4. The building industry text error correction method based on transfer learning of claim 3, wherein: loss function L during training of NSP task₂(θ,θ₂) The mathematical expression of (a) is:

5. The building industry text error correction method based on transfer learning of claim 2 or 4, wherein: introducing a pre-training coefficient lambda to dynamically adjust a pre-training task of the BERT model aiming at different stages to obtain a loss function L' (theta )₁,θ₂) The mathematical expression of (a):

L′(θ,θ₁,θ₂)＝L₁(θ,θ₁)+λL₂(θ,θ₂),λ＝0or1 (4)

6. the building industry text error correction method based on transfer learning of claim 5, wherein: in the first phase, let λ be 0, the MLM task of the BERT model is trained.

7. The building industry text error correction method based on transfer learning of claim 5, wherein: in the second stage, let λ be 1, the MLM task and NSP task of the BERT model are jointly trained.

8. The building industry text error correction method based on transfer learning of claim 1 or 7, wherein: the text error correction task in the building industry comprises missing character completion, redundant character removal, wrong character error correction and word order correction.