CN113836919A - Building industry text error correction method based on transfer learning - Google Patents

Building industry text error correction method based on transfer learning Download PDF

Info

Publication number
CN113836919A
CN113836919A CN202111160118.7A CN202111160118A CN113836919A CN 113836919 A CN113836919 A CN 113836919A CN 202111160118 A CN202111160118 A CN 202111160118A CN 113836919 A CN113836919 A CN 113836919A
Authority
CN
China
Prior art keywords
task
error correction
training
bert model
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111160118.7A
Other languages
Chinese (zh)
Inventor
侯振国
何海英
张中善
杨伟涛
李佳男
张传浩
张培聪
孙维东
阴栋阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Seventh Engineering Division Corp Ltd
General Contracting Co Ltd of China Construction Seventh Engineering Division Corp Ltd
Original Assignee
China Construction Seventh Engineering Division Corp Ltd
General Contracting Co Ltd of China Construction Seventh Engineering Division Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Seventh Engineering Division Corp Ltd, General Contracting Co Ltd of China Construction Seventh Engineering Division Corp Ltd filed Critical China Construction Seventh Engineering Division Corp Ltd
Priority to CN202111160118.7A priority Critical patent/CN113836919A/en
Publication of CN113836919A publication Critical patent/CN113836919A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Abstract

The invention provides a building industry text error correction method based on transfer learning, which is used for solving the problem that text error correction tasks are difficult due to lack of multi-field composite document data represented by building construction scheme documents in the prior art. The method comprises the steps of firstly establishing a building document corpus data set and a non-label related field data set, then carrying out text error correction on the data set by using a BERT model, applying the pretrained BERT model to the non-label related field data set by adopting a transfer learning method to obtain vocabulary information from different subdivided fields, and finally extracting training samples in the building document corpus data set to retrain the transfer learned BERT model so as to be suitable for a text error correction task in the building industry. In order to dynamically adjust the pre-training task aiming at different training stages, a pre-training coefficient is introduced to improve the performance of the word order correction task. The invention has higher accuracy, recall rate and lower false alarm rate on the error correction task.

Description

Building industry text error correction method based on transfer learning
Technical Field
The invention relates to the technical field of text error correction in the building industry, in particular to a text error correction method based on transfer learning in the building industry.
Background
The building industry is the prop industry of national economy, and has no alternative status and function in social production and material life. In recent years, with the rapid development of the construction industry in China, the scale of the construction industry is continuously enlarged, and the demand of the construction market is increasingly increased, such as industrial buildings, civil buildings, public buildings and the like. The building construction scheme plays a role in building construction, and the feasibility and the rationality of the scheme are ensured by comprehensively considering various factors on the basis of a large amount of theoretical analysis and field investigation. With the development of informatization of the building industry, the building construction scheme is also transited from paperization to electronization, but the problem of text errors in the scheme is inevitably brought about. The errors cause the text information to be inaccurate and not authoritative, even result in grammar errors and semantic errors of sentences, reduce the readability of texts in the building industry, and generate ambiguity and cause loss when the texts are serious.
Text error correction is one of the most common tasks of Natural Language Processing (NLP) and has important value in the auditing of texts in the building industry. For a compound text of a multi-domain proper noun such as a text in the building industry, the deficiency of tagged data can cause the error correction algorithm based on deep learning to have poor effect. For massive text data in the building industry, only manual review is unrealistic, so that automatic detection and correction of text content in the building industry have important significance for text review in the building industry, the correctness and normalization of text language use in the building industry are improved, and manual workload is reduced.
Disclosure of Invention
Aiming at the problem that text error correction task is difficult due to lack of multi-field compound document data represented by building construction scheme documents in the prior art, the invention provides a building industry text error correction method based on transfer learning.
In order to solve the technical problems, the invention adopts the following technical scheme: a building industry text error correction method based on transfer learning comprises the following steps:
step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set.
Step S2: and performing text error correction on the text data set by using the BERT model, and pre-training the BERT model.
Step S3: and entering a first stage, training the pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields.
Step S4: and after the first stage is finished, the second stage is started, partial data in the corpus data set of the building document is extracted as training samples, and the training samples are used for retraining the BERT model after the migration learning, so that the BERT model is suitable for text error correction tasks in the building industry.
The pre-training tasks of the BERT model comprise an MLM task and an NSP task, and in the pre-training process of the BERT model, in order to minimize a combined loss function of the two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )12) The mathematical expression of (a) is:
L(θ,θ12)=L1(θ,θ1)+L2(θ,θ2) (1)
in the formula: θ is a parameter of the encoder portion of the BERT model; theta1Is a parameter in the output layer connected to the encoder in the MLM task; theta2Is the classifier parameters on the encoder side in the NSP task; l is1(θ,θ1) Representing a loss function of training of the MLM task; l is2(θ,θ2) Representing the loss function of the NSP task for training.
The MLM task is to randomly cover partial words of an input sentence and predict the covered words by adopting an unsupervised learning method; loss function L in training of MLM task1(θ,θ1) The mathematical expression of (a) is:
Figure BDA0003289805440000021
in the formula: i represents the element order of traversal; m represents a predicted word; m isiIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set (number) of input sentences in the MLM task, wherein the words are randomly covered; | V | represents a dictionary size.
Loss function L during training of NSP task2(θ,θ2) The mathematical expression of (a) is:
Figure BDA0003289805440000022
in the formula: j represents the element order of traversal; n represents the word order prediction result; n isjRepresents the correct word in order j; n represents the number of sentences; IsNext means correct in language order; NotNext refers to an error in language order.
Introducing a pre-training coefficient lambda to dynamically adjust a pre-training task of the BERT model aiming at different stages to obtain a loss function L' (theta )12) The mathematical expression of (a):
L′(θ,θ12)=L1(θ,θ1)+λL2(θ,θ2),λ=0 or 1 (4)
in the first phase, let λ be 0, the MLM task of the BERT model is trained.
In the second stage, let λ be 1, the MLM task and NSP task of the BERT model are jointly trained.
The text error correction task in the building industry comprises missing character completion, redundant character removal, wrong character error correction and word order correction.
The invention has the beneficial effects that: the invention is based on the BERT model and adopts a transfer learning method, so that the BERT model after pre-training is learned from data sets of other related fields without labels, and then the BERT model is retrained again through training samples of the construction document corpus data set, so that the BERT model can obtain better error correction effect on the error correction of texts in the construction industry; meanwhile, in the process of pretraining, the BERT model introduces pretraining coefficients in order to dynamically adjust pretraining tasks in different training stages, so as to improve the performance of the word order correction tasks. Compared with other existing models, the error correction model provided by the invention has higher accuracy rate, recall rate and lower false alarm rate on the error correction task.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a BERT model architecture;
FIG. 2 is a pre-training phase process of the BERT model;
FIG. 3 is an input diagram of the BERT model;
FIG. 4 is a schematic architecture of the present invention;
FIG. 5 is a flow chart of transfer learning for dynamically adjusting a pre-training task;
FIG. 6 shows the results of ablation experiments on the BERT model of the present invention with model M1, model M2, and model M3, respectively; wherein (a) is a false alarm rate comparison graph; (b) the accuracy is compared with a graph; (c) a recall ratio comparison graph; (d) is an F-Measure (F metric) comparison graph.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The invention provides a building industry text error correction method based on transfer learning, which comprises the following steps:
step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set. The building document corpus data set is used as a target data set, and the target data set is utilized to train and verify the capability of the BERT model in the aspect of building field text error correction.
In the present embodiment, a large number of construction plan documents from the construction industry are used, including nearly ten thousand statements covering proper nouns in different fields including geography, geology, chemical materials, construction equipment, construction regulations, and the like. Representative 5400 sentences are selected from the nearly ten thousand sentences and are divided into five groups, and each group has about 1000 sentences. One group of the model is not modified at all and is used for training and verifying the BERT model; and the other four groups are manually modified according to different text error types. The five groups of corpora are named as a building document corpus data set and are regarded as a target data set, the target data set is subdivided into a training sample set and a test sample set, then the training sample set is used for training the capability of the model in the aspect of building field text error correction, and the test sample set is used for verifying the capability of the model in the aspect of building field text error correction. Because the target task has fewer data sets and is rich in proper nouns from different fields, the method for attempting transfer learning in this embodiment utilizes the unlabeled data of the related field, and enables the BERT model to obtain the vocabulary recognition capability closer to the target task by training the BERT model and fine-tuning the related parameters in the training process. In this embodiment, 1538 professional documents in five fields of geography, geology, chemical materials, construction equipment, building regulations and the like are selected from Chinese document databases of the Chinese network of knowledge, the Wanfang and the like, which total about 42 ten thousand characters, form a data set of related fields without labels, and provide a training and learning library for the BERT model.
Four text error types, namely missing word completion, redundant word removal, wrong word correction and word order correction, are mainly concerned in the text error correction task of the building industry, and are shown in table 1:
TABLE 1 error examples and types
Figure BDA0003289805440000041
Namely, when the other four groups are modified manually, the four text error types in the table 1 are modified respectively. The set of sentences which are not modified by human is corrected by using the BERT model, and then the four sets of modified text sentences can be compared to check the error correction capability of the BERT model. When the group of unmodified text sentences is input, no error or one or more errors are possible, and if no error is found in the BERT model, the program automatically returns to be correct; otherwise, the program automatically returns an error, and displays the position of the error and the corrected statement.
Step S2: and performing text error correction on the text data set by using the BERT model, and pre-training the BERT model. BERT is a pre-trained language model for various Natural Language Processing (NLP) tasks, with bi-directional coding and feature extraction capabilities. As shown in fig. 1, the BERT model architecture is divided into an Embedding Layer, a transform Layer, and a Prediction Layer. As shown in fig. 2, the pretraining task of the BERT model includes two unsupervised prediction tasks, mlm (masked lm) and nsp (next sequence prediction). The MLM task randomly MASKs 15% of words in an input sentence, and predicts the masked words by adopting an unsupervised learning method, because the fine-tuning (fine-tuning) stage has no [ MASK ] token (MASK words), the pre-training and the fine-tuning are not matched, and in order to reduce the influence as much as possible, the strategy adopted by the BERT model is as follows: for these 15% of the words that are masked, there is an 80% probability of replacing [ MASK ], a 10% probability of replacing random words, and a 10% probability of not changing the current word. The nature of the NSP task is a two-classification task, in which the BERT model receives as input pairs of sentences (e.g., a and B) and then predicts whether B is the next sentence of a during the pre-training of the BERT model. There is a 50% probability that B is the next sentence in a and a 50% probability that B is a random sentence in the corpus.
In the pre-training process of the BERT model, in order to minimize the combined loss function of two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )12) The mathematical expression of (a) is:
L(θ,θ12)=L1(θ,θ1)+L2(θ,θ2) (1)
in the formula: θ is a parameter of the encoder portion of the BERT model; theta1Is a parameter in the output layer connected to the encoder in the MLM task; theta2Is the classifier parameters on the encoder side in the NSP task; l is1(θ,θ1) Representing a loss function of training of the MLM task; l is2(θ,θ2) Representing the loss function of the NSP task for training.
Loss function L for the first part of the MLM task1(θ,θ1) If the masked word set is M, since it is a multi-classification problem on a dictionary size | V |, the loss function used is a negative log-likelihood function. And because it needs to be minimized, the loss function of the first part is equivalent to the maximum log-likelihood function, i.e. the loss function L in MLM task training1(θ,θ1) The mathematical expression of (a) is:
Figure BDA0003289805440000051
in the formula: i represents the element order of traversal; m represents a predicted word; m isiIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set (number-randomly selected number of covering words) in which the input sentence in the MLM task is randomly covered; | V | represents a dictionary size.
Loss function L for the second partial NSP task2(θ,θ2) Since the NSP task is a classification problem, the loss function L in the training of the NSP task2(θ,θ2) The mathematical expression of (a) is:
Figure BDA0003289805440000052
in the formula: j represents the element order of traversal; n represents the word order prediction result; n isjRepresents the correct word in order j; n represents the number of sentences; IsNext means correct in language order; NotNext refers to an error in language order.
Further, in order to dynamically adjust the pre-training task of the BERT model for different stages, a pre-training coefficient λ is introduced in the present embodiment, and a loss function L' (θ, θ) is obtained12) The mathematical expression of (a):
L′(θ,θ12)=L1(θ,θ1)+λL2(θ,θ2),λ=0 or 1 (4)
fine tuning is performed on the basis of a BERT pre-training model for different Natural Language Processing (NLP) tasks. And (3) inserting the input and the output of a specific task into a BERT model, and simulating a downstream task by using a Transformer powerful attention mechanism. As shown in fig. 3, the input to the BERT model contains the sum of three vectors: token 12, Segment 12, and Position 12, wherein Token converts each word into a fixed-dimension vector, Segment is used in NSP tasks to distinguish a first sentence from a second sentence, and Position represents the order of words by encoding Position information. In addition, the BERT model adds two special words, [ CLS ] for the classification task (non-classification tasks can be ignored) and [ SEP ] for separating two sentences.
Step S3: and starting to enter a first stage, training the pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields. The method is characterized in that a pretrained BERT model is applied to a label-free related field data set by adopting a transfer learning method, a large amount of label-free data corpora are subjected to unsupervised training to enable the BERT model to learn information such as a plurality of prior languages, syntax and word senses, and the learned characteristics are used in downstream tasks.
Step S4: and after the first stage is finished, the second stage is started, partial data in the corpus data set of the building document is extracted as training samples, namely the training samples in the training sample set of the target data set, and the training samples are used for retraining the BERT model after the transfer learning, so that the BERT model is more suitable for the text error correction task of the building industry related to multi-field texts, and the detection and correction capability of text errors is improved. The specific process is shown in fig. 4.
Of the two pre-training tasks of the BERT model, the MLM task is to predict masked words and the NSP task is to understand the relationship between sentences to predict the order between two sentences. After unsupervised training of a large number of label-free related field data sets, the capacity of the MLM task is remarkably improved, and proper nouns from different fields can be accurately predicted in a target data set. However, since the unlabeled related domain data set is a whole article from each domain, the sentence context of the same domain has strong correlation, while the building construction plan document of the target data set is a compound document relating to multiple domains, and the upper and lower sentences may come from different domains. In this case, the NSP task in the first stage will have negative effects, and finally recognize the correct language order as the wrong language order. In order to solve the problem of misreport of the word sequence, the NSP task is chosen at different stages in the embodiment. In the first stage, when performing unsupervised training by using the unlabeled related domain data set, regarding formula (4), let λ be 0, remove the NSP task, and train only the MLM task, so as to enhance the learning of the model on the proper nouns in different domains and ensure that the model is not affected by the strong related statements in the same domain. In the second stage, regarding formula (4), let λ be 1, unlock training of NSP, and perform joint training on the MLM task and the NSP task. After the fine tuning training of the target data set, the word sequence false alarm rate of the model can be effectively reduced, and the performance of the word sequence correction task is improved. The flow of the transfer learning for dynamically adjusting the pre-training task is shown in fig. 5.
In the embodiment, the accuracy and the robustness of the building industry text error correction method based on the transfer learning are verified through experiments. First, an experimental environment was set up, as shown in table 2:
TABLE 2 Experimental Environment
Figure BDA0003289805440000071
Then, an experimental evaluation index is determined. Table 3 shows the confusion matrix for evaluating performance indicators:
TABLE 3 confusion matrix
Figure BDA0003289805440000072
Where tp (true positive) refers to the number of sentences actually having errors marked as having errors, fn (false negative) refers to the number of sentences actually having errors marked as having no errors, fp (false positive) refers to the number of sentences actually having no errors marked as having errors, and tn (true negative) refers to the number of sentences actually having no errors marked as having no errors.
In this embodiment, false alarm Rate (FPR), precision Rate (P), recall Rate (R), F-Measure (F) are used1) As an experimental evaluation index. The specific calculation formula of these indices is as follows:
Figure BDA0003289805440000073
Figure BDA0003289805440000074
Figure BDA0003289805440000081
Figure BDA0003289805440000082
for the division of the target data set, 70% of the corpus data set of the construction document may be set as a training sample set, and the remaining 30% may be set as a testing sample set. In order to verify the accuracy and robustness of the error correction method provided by the invention, firstly, the BERT model in the invention is compared with a commonly used deep learning text error correction model in an experiment, such as an LSTM model, a transducer _ self-attack model, a Seq2 Seq-attack model and the like. The four models were tested through test samples in the corpus dataset of construction documents, and the results are shown in table 4:
table 4 results of different model experiments
Figure BDA0003289805440000083
As can be seen from Table 4, the BERT model of the present invention achieves a higher improvement in accuracy rate than the conventional deep learning text error correction model, and the accuracy rate on the target data set reaches 81.4%.
The BERT model of the present invention was then used to perform ablation experiments with three other models. The three models are respectively: (1) freezing parameters of the pre-trained BERT model, and directly applying the parameters to a target text error correction task, wherein the parameters are set as M1; (2) after the pre-trained BERT model is subjected to fine adjustment through training samples in the building document corpus data set, the pre-trained BERT model is applied to a target text error correction task and is set as M2; (3) the parameters of the pre-trained BERT model are frozen after the transfer learning of the related field data set, and the parameters are directly applied to a target text error correction task and are set as M3. FIG. 6 is a comparison graph of ablation experiment results, wherein (a) is a comparison graph of false alarm rates; (b) the accuracy is compared with a graph; (c) a recall ratio comparison graph; (d) is an F-Measure comparison graph.
As can be seen from FIG. 6, the text error correction task capability of the target data set is improved significantly by the method provided by the present invention. As shown in fig. 6(a), since there are many different domain proper nouns in the target data set, if the pre-trained BERT model is directly used for text error correction without other operations, the FPR is as high as 19.8%. If the BERT model is subjected to fine tuning of the target data set, FPR is slightly reduced, because the target data set has fewer training samples and the training effect is not ideal. After the transfer learning of the label-free related field data set, the BERT model learns a large number of proper nouns from different fields, and then the BERT model is subjected to fine tuning, so that the FPR of the BERT model is remarkably reduced by only 9.7%, and the robustness of the BERT model is improved. As can be seen from fig. 6(b) and 6(c), the migration-learned and fine-tuned BERT model improves error localization and correction capability, which is shown in the present invention, the BERT model improves accuracy by 13% and recalling rate by 12.8% compared with the pretrained BERT model. Furthermore, from the experimental comparison of M2 with M1 and the experimental comparison of the model proposed by the present invention with M3, it can be concluded that fine tuning for the target data set is important. The comparison of the experimental results of M3 and M1 shows that the effect of the transfer learning of the unlabeled related field data set on the text error correction task is obviously improved, which shows that the transfer learning is greatly improved for the condition of lack of the target data set.
Finally, in order to verify that the method provided by the invention has obvious progress on the word order correction task of the target data set, a BERT model (the model provided by the invention) which is subjected to fine tuning after migration and is provided with NSP task in the first stage and a BERT model (set as M4) which is subjected to fine tuning after migration and is provided with NSP task in the first stage are subjected to experimental comparison, and the false alarm rate is used as an evaluation index of the word order correction task. And classifying the experimental results according to different text error types to obtain the false alarm rate of the word order correction task. The final experimental results are shown in table 5:
TABLE 5 comparison of different model word order correction tasks
Figure BDA0003289805440000091
As can be seen from table 5, compared with M4, the BERT model with post-migration fine tuning and NSP task removal in the first stage reduces the false positive rate of the word order correction task by 1.8%, which indicates that removing the NSP task can effectively reduce the false positive rate of the word order.
Because the error of the text in the building industry is difficult to locate by a common method, the text error correction method in the building industry based on the transfer learning provided by the invention carries out the transfer learning by the unsupervised training of the data set in the label-free related field on the basis of the pretrained BERT model, and then retrains the BERT model after the transfer learning by utilizing the training sample in the building document corpus data set to complete the text error correction task. Meanwhile, in order to improve the performance of the word order correction task, the NSP task is chosen at different stages. Finally, the experimental results are analyzed from a plurality of angles, and the accuracy and the robustness of the method provided by the invention are verified.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A building industry text error correction method based on transfer learning is characterized by comprising the following steps:
step S1: establishing a text data set, wherein the text data set comprises a building document corpus data set and a non-tag related field data set;
step S2: performing text error correction on the text data set by using a BERT model, and performing pre-training on the BERT model;
step S3: entering a first stage, training a pre-trained BERT model by using a label-free related field data set by adopting a progressive migration learning method, so that the BERT model obtains vocabulary information from different subdivided fields;
step S4: and after the first stage is finished, the second stage is started, partial data in the corpus data set of the building document is extracted as training samples, and the training samples are used for retraining the BERT model after the migration learning, so that the BERT model is suitable for text error correction tasks in the building industry.
2. The building industry text error correction method based on transfer learning of claim 1, wherein: the pre-training tasks of the BERT model comprise an MLM task and an NSP task, and in the pre-training process of the BERT model, in order to minimize a combined loss function of the two tasks, the MLM task and the NSP task are subjected to combined training to obtain a loss function L (theta )12) The mathematical expression of (a) is:
L(θ,θ12)=L1(θ,θ1)+L2(θ,θ2) (1)
in the formula: θ is a parameter of the encoder portion of the BERT model; theta1Is a parameter in the output layer connected to the encoder in the MLM task; theta2Is the classifier parameters on the encoder side in the NSP task; l is1(θ,θ1) Representing a loss function of training of the MLM task; l is2(θ,θ2) Representing the loss function of the NSP task for training.
3. The building industry text error correction method based on transfer learning of claim 2, wherein: the MLM task is to the input sentenceSub-randomly masking partial words, and predicting the masked words by adopting an unsupervised learning method; loss function L in training of MLM task1(θ,θ1) The mathematical expression of (a) is:
Figure FDA0003289805430000011
in the formula: i represents the element order of traversal; m represents a predicted word; m isiIndicating the correct word in order i; p represents the probability of a prediction result; m represents a word set in which input sentences in the MLM task are randomly covered; | V | represents a dictionary size.
4. The building industry text error correction method based on transfer learning of claim 3, wherein: loss function L during training of NSP task2(θ,θ2) The mathematical expression of (a) is:
Figure FDA0003289805430000012
in the formula: j represents the element order of traversal; n represents the word order prediction result; n isjRepresents the correct word in order j; n represents the number of sentences; IsNext means correct in language order; NotNext refers to an error in language order.
5. The building industry text error correction method based on transfer learning of claim 2 or 4, wherein: introducing a pre-training coefficient lambda to dynamically adjust a pre-training task of the BERT model aiming at different stages to obtain a loss function L' (theta )12) The mathematical expression of (a):
L′(θ,θ12)=L1(θ,θ1)+λL2(θ,θ2),λ=0or1 (4)
6. the building industry text error correction method based on transfer learning of claim 5, wherein: in the first phase, let λ be 0, the MLM task of the BERT model is trained.
7. The building industry text error correction method based on transfer learning of claim 5, wherein: in the second stage, let λ be 1, the MLM task and NSP task of the BERT model are jointly trained.
8. The building industry text error correction method based on transfer learning of claim 1 or 7, wherein: the text error correction task in the building industry comprises missing character completion, redundant character removal, wrong character error correction and word order correction.
CN202111160118.7A 2021-09-30 2021-09-30 Building industry text error correction method based on transfer learning Pending CN113836919A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111160118.7A CN113836919A (en) 2021-09-30 2021-09-30 Building industry text error correction method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111160118.7A CN113836919A (en) 2021-09-30 2021-09-30 Building industry text error correction method based on transfer learning

Publications (1)

Publication Number Publication Date
CN113836919A true CN113836919A (en) 2021-12-24

Family

ID=78967991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111160118.7A Pending CN113836919A (en) 2021-09-30 2021-09-30 Building industry text error correction method based on transfer learning

Country Status (1)

Country Link
CN (1) CN113836919A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596312A (en) * 2022-05-07 2022-06-07 中国科学院深圳先进技术研究院 Video processing method and device
CN114757169A (en) * 2022-03-22 2022-07-15 中国电子科技集团公司第十研究所 Self-adaptive small sample learning intelligent error correction method based on ALBERT model
CN115344668A (en) * 2022-07-05 2022-11-15 北京邮电大学 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN115983242A (en) * 2023-02-16 2023-04-18 北京有竹居网络技术有限公司 Text error correction method, system, electronic device and medium

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN111950540A (en) * 2020-07-24 2020-11-17 浙江师范大学 Knowledge point extraction method, system, device and medium based on deep learning
CN111985240A (en) * 2020-08-19 2020-11-24 腾讯云计算(长沙)有限责任公司 Training method of named entity recognition model, named entity recognition method and device
CN112270181A (en) * 2020-11-03 2021-01-26 北京明略软件系统有限公司 Sequence labeling method, system, computer readable storage medium and computer device
CN112347773A (en) * 2020-10-26 2021-02-09 北京诺道认知医学科技有限公司 Medical application model training method and device based on BERT model
CN112417884A (en) * 2020-11-05 2021-02-26 广州平云信息科技有限公司 Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration
CN112417877A (en) * 2020-11-24 2021-02-26 广州平云信息科技有限公司 Text inclusion relation recognition method based on improved BERT
CN112559702A (en) * 2020-11-10 2021-03-26 西安理工大学 Transformer-based natural language problem generation method in civil construction information field
CN112667788A (en) * 2020-12-02 2021-04-16 中山大学 Novel BERTEXT-based multi-round dialogue natural language understanding model
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113312454A (en) * 2021-06-17 2021-08-27 辽宁大学 Three-stage story reading understanding training method based on self-supervision
CN113326695A (en) * 2021-04-26 2021-08-31 东南大学 Emotion polarity analysis method based on transfer learning
CN113392209A (en) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 Text clustering method based on artificial intelligence, related equipment and storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN111950540A (en) * 2020-07-24 2020-11-17 浙江师范大学 Knowledge point extraction method, system, device and medium based on deep learning
CN111985240A (en) * 2020-08-19 2020-11-24 腾讯云计算(长沙)有限责任公司 Training method of named entity recognition model, named entity recognition method and device
CN112347773A (en) * 2020-10-26 2021-02-09 北京诺道认知医学科技有限公司 Medical application model training method and device based on BERT model
CN113392209A (en) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 Text clustering method based on artificial intelligence, related equipment and storage medium
CN112270181A (en) * 2020-11-03 2021-01-26 北京明略软件系统有限公司 Sequence labeling method, system, computer readable storage medium and computer device
CN112417884A (en) * 2020-11-05 2021-02-26 广州平云信息科技有限公司 Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration
CN112559702A (en) * 2020-11-10 2021-03-26 西安理工大学 Transformer-based natural language problem generation method in civil construction information field
CN112417877A (en) * 2020-11-24 2021-02-26 广州平云信息科技有限公司 Text inclusion relation recognition method based on improved BERT
CN112667788A (en) * 2020-12-02 2021-04-16 中山大学 Novel BERTEXT-based multi-round dialogue natural language understanding model
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113326695A (en) * 2021-04-26 2021-08-31 东南大学 Emotion polarity analysis method based on transfer learning
CN113312454A (en) * 2021-06-17 2021-08-27 辽宁大学 Three-stage story reading understanding training method based on self-supervision

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孔祥鹏;吾守尔・斯拉木;杨启萌;李哲;: "基于迁移学习的维吾尔语命名实体识别", 东北师大学报(自然科学版), no. 02, 20 June 2020 (2020-06-20) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757169A (en) * 2022-03-22 2022-07-15 中国电子科技集团公司第十研究所 Self-adaptive small sample learning intelligent error correction method based on ALBERT model
CN114596312A (en) * 2022-05-07 2022-06-07 中国科学院深圳先进技术研究院 Video processing method and device
CN115344668A (en) * 2022-07-05 2022-11-15 北京邮电大学 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN115983242A (en) * 2023-02-16 2023-04-18 北京有竹居网络技术有限公司 Text error correction method, system, electronic device and medium

Similar Documents

Publication Publication Date Title
CN113836919A (en) Building industry text error correction method based on transfer learning
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110335653B (en) Non-standard medical record analysis method based on openEHR medical record format
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN110276069B (en) Method, system and storage medium for automatically detecting Chinese braille error
CN116127952A (en) Multi-granularity Chinese text error correction method and device
CN116306600B (en) MacBert-based Chinese text error correction method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN115510863A (en) Question matching task oriented data enhancement method
CN112686040B (en) Event reality detection method based on graph recurrent neural network
CN116562295A (en) Method for identifying enhanced semantic named entity for text in bridge field
Arslan et al. Detecting and correcting automatic speech recognition errors with a new model
CN114153942B (en) Event time sequence relation extraction method based on dynamic attention mechanism
CN115757815A (en) Knowledge graph construction method and device and storage medium
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN114385803A (en) Extraction type reading understanding method based on external knowledge and segment selection
CN114692615A (en) Small sample semantic graph recognition method for small languages
CN115358219A (en) Chinese spelling error correction method integrating unsupervised learning and self-supervised learning
CN115809655A (en) Chinese character symbol correction method and system based on attribution network and BERT
CN114648029A (en) Electric power field named entity identification method based on BiLSTM-CRF model
CN114357166A (en) Text classification method based on deep learning
CN112784587A (en) Text similarity measurement method and device based on multi-model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination