CN112836496A - Text error correction method based on BERT and feedforward neural network - Google Patents
Text error correction method based on BERT and feedforward neural network Download PDFInfo
- Publication number
- CN112836496A CN112836496A CN202110098015.6A CN202110098015A CN112836496A CN 112836496 A CN112836496 A CN 112836496A CN 202110098015 A CN202110098015 A CN 202110098015A CN 112836496 A CN112836496 A CN 112836496A
- Authority
- CN
- China
- Prior art keywords
- text
- error
- word
- bert
- correct
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012937 correction Methods 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 33
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000002372 labelling Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 39
- 238000012549 training Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 24
- 239000011159 matrix material Substances 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013136 deep learning model Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 6
- 238000013519 translation Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000013604 expression vector Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text error correction method based on a BERT and a feedforward neural network, which can quickly and accurately identify and correct errors of large-scale linguistic data. The method comprises the steps of preprocessing a text, performing semantic coding on the text by using BERT, judging whether the text is correct or not by using the integral semantic information of the text, finding out the specific position where an error occurs in the text by using a sequence labeling method for the text judged to be the error, and finally generating the corresponding correct text by using a feedforward neural network in combination with the wrong context information. The text error correction method constructed by the invention has the characteristics of high reasoning speed and good interpretability.
Description
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a text error correction method based on BERT and a feedforward neural network.
Technical Field
Text error correction is a natural language processing technology for correcting error contents in texts, and specifically comprises error correction objects such as spelling error correction, grammar error correction and semantic pragmatic error correction in characteristic scenes. The spelling error correction is characterized in that the length of a text is not changed, and only wrongly written characters appearing in the text are corrected one by one; the grammar error correction and the semantic language error correction need to process errors such as multi-word errors, few-word errors, word errors and word sequence errors in the text, and the length of the text can be changed.
In recent years, large-scale deep pre-training language models such as BERT promote rapid development of the natural language processing field, so that a better initial text semantic representation can be obtained when a specific text processing task is carried out, and the time and cost required by model convergence are reduced.
The traditional text error correction mainly adopts a method based on rules or a translation model, wherein the method based on the rules mainly depends on manual definition of a replacement word dictionary and can only correct specific errors; text error correction using translation models is currently the mainstream method, and neural network-based translation models have been used for error correction instead of statistical-based translation models, which solves text error correction as a translation problem from a wrong sentence to a correct sentence, although it is effective and smooth, but requires a large amount of training data, and there is a problem of long time consumption in use. In addition, if only the spelling errors are corrected, the current method mainly adopts a sequence marking method, can quickly correct the wrongly written characters, but is not suitable for correcting other errors.
Disclosure of Invention
The invention aims to provide a text error correction method based on BERT and a feedforward neural network, aiming at the defects of the prior art. The invention adopts a simple model to identify and correct various errors in the text.
The purpose of the invention is realized by the following technical scheme: a text error correction method based on BERT and feedforward neural network comprises the following steps:
1) and preprocessing the text error correction corpus data.
2) And (2) carrying out BERT coding on the input text preprocessed in the step 1) to obtain feature representation and semantic representation.
3) And judging whether the text is a correct text or not based on the semantic representation of the input text obtained in the step 2).
4) And detecting the position of an error in the text based on the input text feature representation obtained in the step 2) and the judgment result in the step 3).
5) And generating a correct text corresponding to the error text based on the error position found in the step 4).
Further, in the step 1), the data preprocessing method comprises:
1.1) carrying out preprocessing operation on the acquired text data.
1.2) performing word segmentation on the text, and if the text is Chinese, performing word segmentation according to the unit of characters; if the English is English, word segmentation is carried out according to the word piece form.
1.3) add a special character "[ CLS ]" at the beginning of the text and a special character "[ SEP ]" at the end.
1.4) if the text is training data, calculating a text wrong label, a character wrong type label and a label of a wrong corresponding correct text by comparing the segmented source character string and the segmented target character string.
Further, in the step 2), based on the BERT encoded text representation:
2.1) pre-training word vectors and position vectors by using BERT, embedding words and positions into an input text, and obtaining a text preliminary vector representation:
wherein E iswordIs a word-embedding matrix, EposIs a position-embedding matrix, where the size of the word-embedding matrix is [ V, E ]]V is the word list size defined by BERT, E is the embedding dimension, and the size of the position embedding matrix is [512, E]。
2.2) Using L layer Transformer Module in BERTObtaining semantic feature representations for each character The calculation method comprises the following steps:
Hl=Transformer(Hl-1)
2.3) Using "[ CLS]Character corresponding featureObtaining the text overall semantic representation:
further, in the step 3), it is determined whether the text is a correct text:
3.1) selecting the text integral semantic representation c output by the BERT in the step 2.3) as the characteristic for judging the text error.
3.2) using a feedforward neural network to map c to a numerical value, and then using a sigmoid function to calculate the probability that the predicted text is incorrect:
Prw=sigmoid(Wrwc+brw)
wherein, WrwAnd brwAre weight parameters for deep learning model learning.
3.3) adding PrwAnd manually set thresholdAnd comparing to judge whether the text is wrong or not, and if the text is smaller than the threshold value, determining that the text is correct.
And 3.4) for the input text judged to be correct, directly outputting the input text as an error correction result without performing subsequent error correction operation.
3.5) during model training, calculating the loss value of the text to misjudgment by using a binary cross entropy loss function:
Lossrw=BCELoss(Prw,yrw)
wherein, yrwThe true value of the text error is obtained by comparing whether the source character string and the target character string are equal or not.
Further, in the step 4), the position of the error in the text is detected:
4.1) selecting the characteristic representation H of each word output by BERT in step 2.2)LAs a feature of the error type detection.
4.2) defining the type of each character as correct, redundant, correct but later missing content and wording error, wherein each character corresponds to one of the four types, and the corresponding operation labels are respectively reserved, deleted, added and replaced later.
4.3) carrying out sequence labeling on the input text by using a feedforward neural network and combining a softmax function, and detecting the operation required to be carried out on each character:
wherein the content of the first and second substances,respectively representing the probability of one character to be reserved, deleted, added and replaced, and taking the operation corresponding to the maximum probability as the detection result.
4.4) after obtaining the predicted tag sequence of the input text, obtaining the error position pos in the text as (s ', e') based on the rule: for a continuous deleted label or a continuous replaced label, the derived error starting position s 'and the derived error ending position e' are the previous position and the next position of the position interval [ s, e ] where the continuous label is located, i.e. s '-s-1, e' -e + 1; for each tag in the sequence of tags to be added later, the derived error start position s 'is the position s of the tag itself, and the error end position e' is a position after the start position, i.e. e '═ s' + 1.
4.5) for the input text of which the prediction sequence labels are all formed by the reserved labels, the subsequent error correction operation is not carried out, and the input text is directly output as an error correction result.
4.6) during model training, calculating the loss value of the sequence label by using a cross entropy loss function:
wherein the content of the first and second substances,is the real operation tag corresponding to the i1 th character.
Further, in the step 5), a correct text corresponding to the wrong text is generated:
5.1) all character feature vectors H obtained from step 2.2) depending on the error position obtained in step 4.4)LTruncating the input feature vector for correct text generation:
where s and e are the starting and ending positions of an error,is the above information that precedes the error,for context information after error, hmidAs the message of the error itselfWhen the error start position and the error end position are adjacent to each other, hmidA special vector h for self-learning of a modelempOtherwise hmidIs based on the average of the vectors between the error start and end positionsObtained hinfoThe number is equal to the number of errors detected in step 4.4).
5.2) defining a deep learning model learned position embedding matrix E'posThe system is used for controlling characters corresponding to different positions of a generated text and consists of MAX _ LEN vectors with dimension being POS _ DIM, and MAX _ LEN represents the maximum length of the generated text. In correcting each error, use the j dimension of the matrix as vector E 'of POS _ DIM'pos(j) As the position information when the j-th word is generated.
5.3) extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedding vectors:
hi3,j=MLP([hinfo,E′pos(j)])
wherein h isi3,jIs the characteristic that the i3 th error in the text corresponds to the j' th correct word generated.
5.4) combining a softmax function, mapping the error characteristic representation into a dictionary size dimension defined by BERT, and taking the word with the highest probability in the dictionary as the generated jth word:
the weight parameter used by the last layer of the multi-layer feedforward network is a word embedding matrix E in BERTwordThe transposing of (1).
5.5) during model training and use, MAX _ LEN words of corrected text are generated for an error, but only the text before the special character 'EOP' appears in the generated text is intercepted as a result, and if the special character 'EOP' is not generated, all the generated text is taken as a result.
5.6) using the error position detected in the step 4.4) and combining the generated correct text to replace the error content in the input text, and deleting the added special characters "[ CLS ]" and "[ SEP ]", thereby obtaining the final error correction output.
5.7) during model training, calculating a text generation loss value by using a cross entropy loss function:
wherein m is the number of errors in a text; k is a radical ofi3Is the length of the i3 th error corresponding to the correct text, including the end character "[ EOP ]]”;Is the jth word in the correct text that is corresponding to the i3 th error.
The invention has the beneficial effects that: the method uses the sequence label for the processing object of text error correction, so that various types of errors can be quickly and accurately corrected by using a sequence label method, and the method is not limited to spelling error correction; the invention carries out text error correction based on BERT, can carry out error correction on wrong texts in large-scale linguistic data and generate correct texts, simultaneously improves the problem of long time consumption of the traditional error correction method based on a translation model, and optimizes the serial process of text error correction for generating correct sentences word by word into a parallel process of error correction only aiming at wrong contents by using a feedforward neural network.
Drawings
FIG. 1 is a flow chart of a proposed method of the present invention;
FIG. 2 is a diagram of a text error correction model architecture designed by the present invention;
FIG. 3 is a diagram of the internal structure of the BERT model employed in the present invention.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings and specific examples.
The text error correction method based on the BERT and the feedforward neural network uses a deep learning method and combines a pre-training language model BERT, and semantic information such as word part of speech, syntactic structure and the like can be effectively extracted from a text, so that the feature representation of each word in the context is obtained. In addition, three different feedforward neural networks designed by the invention can respectively perform the functions of text misjudgment, error position detection and correct text generation by utilizing the extracted characteristic information, and organically combine all modules together, thus realizing the purpose of text error correction; as shown in fig. 1, the method comprises the following steps:
1. data pre-processing
For text error correction corpus data, firstly performing word segmentation operation on each text; particularly, for Chinese text, segmentation is carried out according to the unit of a character; for English words, besides segmenting each word according to a blank space, each word is segmented into a word piece form by using a statistical result of large-scale English corpus. And after word segmentation, setting a stop word dictionary according to actual requirements, and filtering stop words in the text. In addition, it is necessary to add a special character "[ CLS ]" indicating the start of the text at the beginning of each text and a special character "[ SEP ]" indicating the end of the text at the end of the text.
For the training data set, the input text X ═ X needs to be input by comparison1,x2,…,xnI1 ═ 1 to n and target text T ═ T1,t2,…,tn′Obtaining three target values of model training, i2 being 1-n', respectively: label y whether text is correct or wrongrwTag sequence for marking error typei1 ═ 1 to n and corrected target texti3∈[1,m](ii) a Wherein:
label y whether text is correct or wrongrwE, e {0,1}, judging whether X and T are equal by a calculation method, and if the X and T are equal, indicating that the text is correct, taking the value as 0; otherwise, the sentence is wrong, and the value is 1.
Tag sequence elements for marking error typesRespectively indicating that the ith 1 character word in the input text X needs to be reserved (0), deleted (1), added with characters later (2) and replaced (3); the acquisition method comprises the steps of comparing X and T, analyzing an operation sequence converted from a text X to T by utilizing a sequence Matcher function in a Python self-contained difflib, wherein each operation sequence consists of 5 parts, namely an operation type and a starting position s of the XXEnd position e of XXStart position s of TTAnd end position e of TTIs shown to beBy the operation change toThe function relates to four operation types, namely 'equal', 'delete', 'insert' and 'place', which correspond to four labels. If the operation type is "equivalent", it means in XWord and T inAre the same, then handleAre all set to 0, indicating that these words in X are to be retained; if the operation type is "delete", it means in XIf it needs to be deleted, then handleAre all set to 1, indicating that these words in X are to be deleted; if the operation type is "insert", it indicates the s-th in XXThe s-th word needs to be inserted into T before the wordT~eTWord, this time sXAnd eXEqual, operation pointing to the(s) thX-1) words and an sXIn the middle of a word, then handleSet to 2, indicating a need toAdding a word after the word; if the operation type is "replace", it represents the s-th in XX~eXThe word needs to be replaced by the s-th in TT~eTWord, then handleAre set to 3 indicating that these words in X need to be replaced.
Wherein the corrected target textThe obtaining method needs to use the operation sequence, and if m operations which are not in the 'equivalent' type exist in the operation sequence, the error correction text corresponding to m targetsIf the operation type is "insert" or "replace", thenIs s in TT~(eT-1) words plus a special end character "[ EOP ]]"; if the operation type is "delete", thenFor special characters "[ NONE]"and" [ EOP]”。YslAndhave corresponding relation between them, which is reflected on deleting, adding and replacing labels and corresponding error correcting content except 'equivalent' operation. For each target textWherein the last characterAre all special characters representing the end "[ EOP ]]”。
2. BERT coding
The operation of BERT coding the input text X is mainly divided into two steps:
the first step is word embedding, which converts each word in X into a vector representation, using the two embedding matrices defined by BERT, respectively the word embedding matrix EwordAnd a position embedding matrix Epos(ii) a Wherein the size of the word embedding matrix is [ V, E ]]V is the vocabulary size defined by BERT, E is the embedding dimension; the size of the position embedding matrix is [512, E ]]. The calculation method for word embedding in BERT comprises the following steps:
the second step is a self-attention coding expression module based on the Transfomer, which is composed of L-layer Transfomer modules, each layer is calculated in the same way, and the output of the previous layer is used as the input:
Hl=Transformer(Hl-1)
wherein L is 1 to L; the input of the first layer is the text after embedding wordsi1 is 1 to n. Finally, the output of the BERT L layer transform module is takenAs a feature representation of the input text X.
In addition, BERT pre-trains the first input character "[ CLS ]]"corresponding outputThe feature expression vector c as the prediction judgment of the next sentence can be used as the semantic expression of the whole input text X, and the calculation method is as follows:
wherein, tanh is an activation function, WcParameter matrices learned for the model, bcBias vectors learned for the model.
3. Text error judgment
Judging whether the input text X is a correct text, if so, not performing subsequent error position detection and text error correction operation, wherein the judgment method is to utilize the semantic representation c of the input text obtained in the step 2 and combine a feedforward neural network to perform two classification tasks of error judgment:
Prw=sigmoid(Wrwc+brw)
wherein, the output result of the two classification tasks is the error probability P of the input textrw∈[0,1]If it is greater than the set threshold valueThe input text is considered to be wrong, otherwise, the judgment is correct; wrwAnd brwIs a weight parameter for deep learning model learning; sigmoid is an activation function.
During model training, a binary cross entropy loss function BCELoss is used for calculating a loss value of text misjudgment:
Lossrw=BCELoss(Prw,yrw)
4. error location detection
Test articleAnd (4) determining which positions in the text are wrong, and labeling the corresponding positions according to the error types. Processing the feature representation H of the input text obtained in the step 2 by a feedforward neural network by adopting a sequence labeling methodL:
Wherein i1 is 1 to n; wslParameter matrices learned for the model, bslA bias vector for model learning;is a vector composed of 4 elements, which respectively represent the probability of reserving, deleting, adding and replacing the i1 th word in the input text, and corresponds toTaking the operation corresponding to the value with the maximum probability as the result of sequence labeling; softmax is the activation function.
During model training, the cross entropy loss function CrossEntropLosss is used to calculate the loss value of the sequence annotation:
after the feedforward neural network is used for sequence marking, and the operation required to be carried out on each word is detected, the starting position and the ending position of the error in the text can be calculated. Suppose that { X ] is input among n operations obtained after inputting text X of n characterss,…,xeThe corresponding operations of (e is more than or equal to s) are the same and are not reserved operations; then if the operation is delete, define the error position as pos ═ s-1, e + 1; wherein the content of the first and second substances,s-1 and e +1 represent the starting and ending positions of the error, respectively; if the operation is replacement, defining the error position as pos ═ s-1, e + 1; if the operation is added later, e-s +1 error locations, pos respectively, are definedi4=(s+i4-1,s+i4),i4∈[1,e-s+1]Similarly, s + i4-1 and s + i4 represent the starting and ending positions of the error. After the position was calculated as described above, pos ═ s ', e' was assigned for convenience.
In the invention, the defined error starting position is moved forward by one unit than the actual position, and the defined error ending position is moved backward by one unit than the actual position, so that the input data for generating the correct text can be conveniently acquired, and the processing flows of various error types can be unified. For multiple word errors and word errors, deletion and replacement operations need to be carried out on the multiple word errors and the word errors, the positions of errors can be directly defined on wrong words in a text, and for few word errors, missing words need to be inserted between two words, the two words have no errors, so that the range of the positions of errors needs to be expanded by one word, namely the initial position is the wrong previous word, and the end position is the wrong next word.
5. Correct text generation
After the error position is known, based on the context semantic information of the error, generating correct texts corresponding to the errors in parallel by a feedforward neural network method; the method needs to be carried out in three steps:
firstly, intercepting the text representation H obtained in the step 2 according to the error position obtained in the step 4L. For an error position pos in the text being (s ', e'), the states of the error starting position s 'and the error ending position e' are taken out firstAndthen taking out the intermediate state vectorAnd taking the average value of the vectors to obtain the characteristic vector of the error:
where mean represents the averaging. In this step, if few word errors are processed, s '+ 1 ═ e', i.e. there is no intermediate state vector, for which an initial value is set to a random value, and the variable h optimized in the model training is setempInstead of h in this casemid. Finally, the three vectors are spliced together to serve as context information of error content:
and secondly, extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedded vectors. Due to the need to utilize an erroneous context information hinfoRapid generation of correct text composed of multiple words or special characters, so a new position embedding matrix E 'is set'posFor distinguishing the content of each word generated, the embedded dimension of the matrix is POS _ DIM. Feature h extraction using two-layer feedforward neural networki3,jAn activation function MLP and normalization operations are also set in each layer:
hi3,j=MLP([hinfo,E′pos(j)])
wherein i3 refers to the i3 th error in the text, and j refers to the j th word generated by the i3 th error target. When the model is actually used for error correction, the correct text corresponding to the error is not known to be formed by a plurality of words, so that MAX _ LEN words are uniformly limited to be generated during training and using the model, and the words generated before the special character 'EOP' in the generated text are intercepted as an error correction result according to the position where the special character 'EOP' appears. If no special character "[ EOP ]" is generated, all generated text is used as a result.
Third, using a feedforward neural network to derive the feature hi3,jExtract the correct text of the target
Wherein j is 1 to ki3(ii) a Formula (a) toi,jMapping the word to a vector with the size of V, and carrying out normalization operation, namely obtaining a probability distribution of the jth word generated when the ith 2 th error is corrected in a BERT definition dictionary, and taking the word corresponding to the value with the highest probability in the dictionary as the jth word generated.
Through the operation, the correct text corresponding to the error detected in the step 4 can be obtained, and the original text can be taken back to be modified by combining the position of the error in the input text, so that the text after error correction can be obtained.
During model training, the cross entropy loss function crossentryploss is used to calculate the loss value of text generation:
where m is the number of errors corresponding to a sample, ki3Corresponding to correct text for the i3 th error (including the special character "[ SOP ]]") length, MAX _ LEN words are output for each error during training, but only the top k is calculated when calculating the penalty valuei3Loss value of each output.
In order to embody the universality of the invention for correcting various text error types, the text error correction method provided by the invention is explained by combining four text examples. Assuming that there is a correct input text X1 ═ protection intellectual property; an input text X2 with few word errors is "protected knowledge"; an input text with wording error X3 ═ protection of intellectual property right; an input text X4 with multiple word errors is "protected intellectual property", and how to correct the four texts is described below.
As shown in fig. 2 to 3, the method of the present embodiment comprises five steps:
step 1: preprocessing input text, performing word segmentation on Chinese text by taking a word as a unit, and adding a special character ([ CLS ] ") at the beginning of the text and a special character ([ SEP ]") at the end of the text.
The obtained processing results are X1 [ "[ CLS ]", "protection", "know", "identify", "produce", "weight", "[ SEP ]" ], X2 [ "[ CLS ]", "protection", "know", "identify", "" [ SEP ] "], X3 [" [ CLS ] "," protection "," know "," only "," produce "," weight "," [ SEP ] "], X4 [" [ CLS ] "," protection "," know "," identify "," produce "," weight "," [ SEP ] "].
It is assumed that the error correction model proposed by the present invention is trained and is in the stage of performing error correction by using the model, so that it is not necessary to calculate the target value required by the model training.
Step 2: input text is encoded by using BERT, and a basic version of Chinese pre-training BERT model is assumed to be used, wherein the word embedding dimension of the BERT model is 768, the size of a word table is 21128, and 12 layers of Transformer modules are stacked together.
The processing flow comprises the steps of embedding words and positions in texts, inputting the embedded texts into a Transformer module, taking the output of the last layer of Transformer module, and obtaining the overall characteristic representation c of each text and the semantic representation of each word in each text in the texti1∈[1,n]Where n is the text length after the preprocessing of step 1, such as for the text X1, the text length is8。
And step 3: judging the text mismatching, inputting the text integral representation c obtained in the step 2 into a feedforward neural network, and then inputting a sigmoid function to obtain the probability P of the text mismatchingrwE (0, 1). Comparing the error probability with a manually set thresholdAnd comparing, and if the comparison result is larger than the threshold value, considering that the text is wrong, and otherwise, considering that the text is correct.
For example, the above four texts have prediction results of 0.1, 0.8, 0.9 and 0.8, respectively, and the threshold value set is 0.5, the first text X1 is considered to be correct, and the other three texts are wrong.
And for the text with correct prediction, directly taking the original input text as the corrected output of the model, and not performing subsequent error correction operation.
And 4, step 4: the sequence marks the location of the error. For the text containing errors detected in step 3, the specific location of the error is detected in this step by using a feed forward neural network to semantically represent each word obtained in step 2The operation label probability required to be carried out on each word can be obtained by carrying out classification and combining with the softmax functionAnd finally, taking the position corresponding to the maximum probability value.
The operation types are four types including retention, deletion, addition and replacement, and their numerical values are respectively represented as 0,1, 2 and 3. For text X2, the probability assumption corresponding to the "recognition" word therein isThe type of operation that needs to be performed on it is 2, i.e. the word is added after the word is "recognized". Similarly, the operation sequence of the text X2 can be obtained as [0,0,0,0,2,0 ]]The operation sequence of the text X3 is [0,0,0,0,3,0,0]The operation sequence of the text X2 is [0,0,1,0,0,0,0 ]]。
After obtaining the operation sequence, the starting position and the ending position of the error in the input text can be calculated, such as that the error position of the text X2 is (5,6), that of X3 is (4,6), and that of X4 is (2, 4).
If a text has no error position detected in the step, that is, the obtained operation sequence is composed of all 0 s, the text is considered to be correct, and the text is directly output as the error correction result of the model.
And 5: the correct text is generated. This step requires obtaining a semantic representation of the words using step 2And step 4, obtaining error positions, firstly taking out semantic expression vectors between corresponding positions according to the error positions, then keeping vectors at two ends as context information of error start and error end, taking the average value of intermediate vectors as the characteristic information of the error, and if no intermediate vector exists, using a vector h representing a null stateempRepresenting, finally splicing the three vectors to obtain wrong representing information hinfo. For example, the error indication information corresponding to the text X2 is
Then by giving hinfoPosition vector E 'with splicing embedding dimension POS _ DIM of 200'posAnd inputting a twice feedforward neural network to obtain an intermediate representation vector of a j word to be generated. Setting the maximum length MAX _ LEN of the text that can be generated to 3, a corrected text of 3 words is generated for each error, and for errors in the text X2, the input at step 5 isAnd
finally word embedding using BERT itselfInto matrix EwordAnd a softmax function, which maps the intermediate vector to a word in the dictionary to complete the generation of the correct text. The three words finally generated are respectively ' produce ', ' weight ' and ' [ EOP ]]Adding two characters of 'property right' to the back of the original input text 'protection intellectual property', and finally obtaining the text after error correction as 'protection intellectual property right'; for text X3, a similar would generate "identify", "[ EOP]"and" is "three words, but here only" [ EOP ]]"preceding words are the result and following words are discarded.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (6)
1. A text error correction method based on BERT and feedforward neural network is characterized by comprising the following steps:
1) and preprocessing the text error correction corpus data.
2) And (2) carrying out BERT coding on the input text preprocessed in the step 1) to obtain feature representation and semantic representation.
3) And judging whether the text is a correct text or not based on the semantic representation of the input text obtained in the step 2).
4) And detecting the position of an error in the text based on the input text feature representation obtained in the step 2) and the judgment result in the step 3).
5) And generating a correct text corresponding to the error text based on the error position found in the step 4).
2. The text error correction method based on the BERT and feedforward neural network as claimed in claim 1, wherein in the step 1), the data preprocessing method:
1.1) carrying out preprocessing operation on the acquired text data.
1.2) performing word segmentation on the text, and if the text is Chinese, performing word segmentation according to the unit of characters; if the English is English, word segmentation is carried out according to the word piece form.
1.3) add a special character "[ CLS ]" at the beginning of the text and a special character "[ SEP ]" at the end.
1.4) if the text is training data, calculating a text wrong label, a character wrong type label and a label of a wrong corresponding correct text by comparing the segmented source character string and the segmented target character string.
3. The method of text error correction based on BERT and feedforward neural networks as claimed in claim 2, wherein in said step 2), the text representation is encoded based on BERT:
2.1) pre-training word vectors and position vectors by using BERT, embedding words and positions into an input text, and obtaining a text preliminary vector representation:
wherein E iswordIs a word-embedding matrix, EposIs a position-embedding matrix, where the size of the word-embedding matrix is [ V, E ]]V is the word list size defined by BERT, E is the embedding dimension, and the size of the position embedding matrix is [512, E]。
2.2) obtaining semantic feature representation of each character by utilizing an L-layer Transformer module in BERT The calculation method comprises the following steps:
Hl=Transformer(Hl-1)
2.3) Using "[ CLS]Character corresponding featureObtaining the text overall semantic representation:
4. the text error correction method based on BERT and feedforward neural networks as claimed in claim 3, wherein in the step 3), it is determined whether the text is a correct text:
3.1) selecting the text integral semantic representation c output by the BERT in the step 2.3) as the characteristic for judging the text error.
3.2) using a feedforward neural network to map c to a numerical value, and then using a sigmoid function to calculate the probability that the predicted text is incorrect:
Prw=sigmoid(Wrwc+brw)
wherein, WrwAnd brwAre weight parameters for deep learning model learning.
3.3) adding PrwAnd manually set thresholdAnd comparing to judge whether the text is wrong or not, and if the text is smaller than the threshold value, determining that the text is correct.
And 3.4) for the input text judged to be correct, directly outputting the input text as an error correction result without performing subsequent error correction operation.
3.5) during model training, calculating the loss value of the text to misjudgment by using a binary cross entropy loss function:
Lossrw=BCELoss(Prw,yrw)
wherein, yrwThe true value of the text error is obtained by comparing whether the source character string and the target character string are equal or not.
5. The text error correction method based on the BERT and feedforward neural networks as claimed in claim 4, wherein in the step 4), the position of the error in the text is detected:
4.1) selecting the characteristic representation H of each word output by BERT in step 2.2)LAs a feature of the error type detection.
4.2) defining the type of each character as correct, redundant, correct but later missing content and wording error, wherein each character corresponds to one of the four types, and the corresponding operation labels are respectively reserved, deleted, added and replaced later.
4.3) carrying out sequence labeling on the input text by using a feedforward neural network and combining a softmax function, and detecting the operation required to be carried out on each character:
wherein the content of the first and second substances,respectively representing the probability of one character to be reserved, deleted, added and replaced, and taking the operation corresponding to the maximum probability as the detection result.
4.4) after obtaining the predicted tag sequence of the input text, obtaining the error position pos in the text as (s ', e') based on the rule: for a continuous deleted label or a continuous replaced label, the derived error starting position s 'and the derived error ending position e' are the previous position and the next position of the position interval [ s, e ] where the continuous label is located, i.e. s '-s-1, e' -e + 1; for each tag in the sequence of tags to be added later, the derived error start position s 'is the position s of the tag itself, and the error end position e' is a position after the start position, i.e. e '═ s' + 1.
4.5) for the input text of which the prediction sequence labels are all formed by the reserved labels, the subsequent error correction operation is not carried out, and the input text is directly output as an error correction result.
4.6) during model training, calculating the loss value of the sequence label by using a cross entropy loss function:
6. The text error correction method based on the BERT and feedforward neural networks as claimed in claim 5, wherein in the step 5), the correct text corresponding to the wrong text is generated:
5.1) all character feature vectors H obtained from step 2.2) depending on the error position obtained in step 4.4)LTruncating the input feature vector for correct text generation:
where s and e are the starting and ending positions of an error,is the above information that precedes the error,for context information after error, hmidFor error information, when the error startsWhen the position and the end position are adjacent, hmidA special vector h for self-learning of a modelempOtherwise hmidIs based on the average of the vectors between the error start and end positionsObtained hinfoThe number is equal to the number of errors detected in step 4.4).
5.2) defining a deep learning model learned position embedding matrix E'posThe system is used for controlling characters corresponding to different positions of a generated text and consists of MAX _ LEN vectors with dimension being POS _ DIM, and MAX _ LEN represents the maximum length of the generated text. In correcting each error, use the j dimension of the matrix as vector E 'of POS _ DIM'pos(j) As the position information when the j-th word is generated.
5.3) extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedding vectors:
hi3,j=MLP([hinfo,E′pos(j)])
wherein h isi3,jIs the characteristic that the i3 th error in the text corresponds to the j' th correct word generated.
5.4) combining a softmax function, mapping the error characteristic representation into a dictionary size dimension defined by BERT, and taking the word with the highest probability in the dictionary as the generated jth word:
the weight parameter used by the last layer of the multi-layer feedforward network is a word embedding matrix E in BERTwordThe transposing of (1).
5.5) during model training and use, MAX _ LEN words of corrected text are generated for an error, but only the text before the special character 'EOP' appears in the generated text is intercepted as a result, and if the special character 'EOP' is not generated, all the generated text is taken as a result.
5.6) using the error position detected in the step 4.4) and combining the generated correct text to replace the error content in the input text, and deleting the added special characters "[ CLS ]" and "[ SEP ]", thereby obtaining the final error correction output.
5.7) during model training, calculating a text generation loss value by using a cross entropy loss function:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110098015.6A CN112836496B (en) | 2021-01-25 | 2021-01-25 | Text error correction method based on BERT and feedforward neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110098015.6A CN112836496B (en) | 2021-01-25 | 2021-01-25 | Text error correction method based on BERT and feedforward neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112836496A true CN112836496A (en) | 2021-05-25 |
CN112836496B CN112836496B (en) | 2024-02-13 |
Family
ID=75931336
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110098015.6A Active CN112836496B (en) | 2021-01-25 | 2021-01-25 | Text error correction method based on BERT and feedforward neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112836496B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113743101A (en) * | 2021-08-17 | 2021-12-03 | 北京百度网讯科技有限公司 | Text error correction method and device, electronic equipment and computer storage medium |
CN114065738A (en) * | 2022-01-11 | 2022-02-18 | 湖南达德曼宁信息技术有限公司 | Chinese spelling error correction method based on multitask learning |
CN115169330A (en) * | 2022-07-13 | 2022-10-11 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for correcting and verifying Chinese text |
CN115547313A (en) * | 2022-09-20 | 2022-12-30 | 海南大学 | Method for controlling sudden stop of running vehicle based on voice of driver |
CN116127953A (en) * | 2023-04-18 | 2023-05-16 | 之江实验室 | Chinese spelling error correction method, device and medium based on contrast learning |
CN116136957A (en) * | 2023-04-18 | 2023-05-19 | 之江实验室 | Text error correction method, device and medium based on intention consistency |
CN116306589A (en) * | 2023-05-10 | 2023-06-23 | 之江实验室 | Method and device for medical text error correction and intelligent extraction of emergency scene |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN112131352A (en) * | 2020-10-10 | 2020-12-25 | 南京工业大学 | Method and system for detecting bad information of webpage text type |
WO2021000362A1 (en) * | 2019-07-04 | 2021-01-07 | 浙江大学 | Deep neural network model-based address information feature extraction method |
-
2021
- 2021-01-25 CN CN202110098015.6A patent/CN112836496B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021000362A1 (en) * | 2019-07-04 | 2021-01-07 | 浙江大学 | Deep neural network model-based address information feature extraction method |
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN112131352A (en) * | 2020-10-10 | 2020-12-25 | 南京工业大学 | Method and system for detecting bad information of webpage text type |
Non-Patent Citations (1)
Title |
---|
王辰成;杨麟儿;王莹莹;杜永萍;杨尔弘;: "基于Transformer增强架构的中文语法纠错方法", 中文信息学报, no. 06 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113743101A (en) * | 2021-08-17 | 2021-12-03 | 北京百度网讯科技有限公司 | Text error correction method and device, electronic equipment and computer storage medium |
CN114065738A (en) * | 2022-01-11 | 2022-02-18 | 湖南达德曼宁信息技术有限公司 | Chinese spelling error correction method based on multitask learning |
CN115169330A (en) * | 2022-07-13 | 2022-10-11 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for correcting and verifying Chinese text |
CN115169330B (en) * | 2022-07-13 | 2023-05-02 | 平安科技(深圳)有限公司 | Chinese text error correction and verification method, device, equipment and storage medium |
CN115547313A (en) * | 2022-09-20 | 2022-12-30 | 海南大学 | Method for controlling sudden stop of running vehicle based on voice of driver |
CN116127953A (en) * | 2023-04-18 | 2023-05-16 | 之江实验室 | Chinese spelling error correction method, device and medium based on contrast learning |
CN116136957A (en) * | 2023-04-18 | 2023-05-19 | 之江实验室 | Text error correction method, device and medium based on intention consistency |
CN116136957B (en) * | 2023-04-18 | 2023-07-07 | 之江实验室 | Text error correction method, device and medium based on intention consistency |
CN116306589A (en) * | 2023-05-10 | 2023-06-23 | 之江实验室 | Method and device for medical text error correction and intelligent extraction of emergency scene |
CN116306589B (en) * | 2023-05-10 | 2024-02-09 | 之江实验室 | Method and device for medical text error correction and intelligent extraction of emergency scene |
Also Published As
Publication number | Publication date |
---|---|
CN112836496B (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112836496B (en) | Text error correction method based on BERT and feedforward neural network | |
CN108959252B (en) | Semi-supervised Chinese named entity recognition method based on deep learning | |
CN110489760B (en) | Text automatic correction method and device based on deep neural network | |
CN111626063B (en) | Text intention identification method and system based on projection gradient descent and label smoothing | |
CN110334354B (en) | Chinese relation extraction method | |
CN110569508A (en) | Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism | |
CN108717574B (en) | Natural language reasoning method based on word connection marking and reinforcement learning | |
CN110008472B (en) | Entity extraction method, device, equipment and computer readable storage medium | |
CN110263325B (en) | Chinese word segmentation system | |
CN111767718B (en) | Chinese grammar error correction method based on weakened grammar error feature representation | |
CN110008469A (en) | A kind of multi-level name entity recognition method | |
CN114282527A (en) | Multi-language text detection and correction method, system, electronic device and storage medium | |
CN113221571B (en) | Entity relation joint extraction method based on entity correlation attention mechanism | |
CN112463924B (en) | Text intention matching method for intelligent question answering based on internal correlation coding | |
CN114863429A (en) | Text error correction method and training method based on RPA and AI and related equipment thereof | |
CN112417823B (en) | Chinese text word order adjustment and word completion method and system | |
CN112183083A (en) | Abstract automatic generation method and device, electronic equipment and storage medium | |
CN113221542A (en) | Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening | |
CN114692568A (en) | Sequence labeling method based on deep learning and application | |
CN114548099A (en) | Method for jointly extracting and detecting aspect words and aspect categories based on multitask framework | |
CN111898337B (en) | Automatic generation method of single sentence abstract defect report title based on deep learning | |
CN111858894A (en) | Semantic missing recognition method and device, electronic equipment and storage medium | |
CN116681061A (en) | English grammar correction technology based on multitask learning and attention mechanism | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
CN112131879A (en) | Relationship extraction system, method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |