CN112836496A - Text error correction method based on BERT and feedforward neural network - Google Patents

Text error correction method based on BERT and feedforward neural network Download PDF

Info

Publication number
CN112836496A
CN112836496A CN202110098015.6A CN202110098015A CN112836496A CN 112836496 A CN112836496 A CN 112836496A CN 202110098015 A CN202110098015 A CN 202110098015A CN 112836496 A CN112836496 A CN 112836496A
Authority
CN
China
Prior art keywords
text
error
word
bert
correct
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110098015.6A
Other languages
Chinese (zh)
Other versions
CN112836496B (en
Inventor
潘法昱
曹斌
於其之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Zhejiang Lab
Original Assignee
Zhejiang University of Technology ZJUT
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT, Zhejiang Lab filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110098015.6A priority Critical patent/CN112836496B/en
Publication of CN112836496A publication Critical patent/CN112836496A/en
Application granted granted Critical
Publication of CN112836496B publication Critical patent/CN112836496B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text error correction method based on a BERT and a feedforward neural network, which can quickly and accurately identify and correct errors of large-scale linguistic data. The method comprises the steps of preprocessing a text, performing semantic coding on the text by using BERT, judging whether the text is correct or not by using the integral semantic information of the text, finding out the specific position where an error occurs in the text by using a sequence labeling method for the text judged to be the error, and finally generating the corresponding correct text by using a feedforward neural network in combination with the wrong context information. The text error correction method constructed by the invention has the characteristics of high reasoning speed and good interpretability.

Description

Text error correction method based on BERT and feedforward neural network
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a text error correction method based on BERT and a feedforward neural network.
Technical Field
Text error correction is a natural language processing technology for correcting error contents in texts, and specifically comprises error correction objects such as spelling error correction, grammar error correction and semantic pragmatic error correction in characteristic scenes. The spelling error correction is characterized in that the length of a text is not changed, and only wrongly written characters appearing in the text are corrected one by one; the grammar error correction and the semantic language error correction need to process errors such as multi-word errors, few-word errors, word errors and word sequence errors in the text, and the length of the text can be changed.
In recent years, large-scale deep pre-training language models such as BERT promote rapid development of the natural language processing field, so that a better initial text semantic representation can be obtained when a specific text processing task is carried out, and the time and cost required by model convergence are reduced.
The traditional text error correction mainly adopts a method based on rules or a translation model, wherein the method based on the rules mainly depends on manual definition of a replacement word dictionary and can only correct specific errors; text error correction using translation models is currently the mainstream method, and neural network-based translation models have been used for error correction instead of statistical-based translation models, which solves text error correction as a translation problem from a wrong sentence to a correct sentence, although it is effective and smooth, but requires a large amount of training data, and there is a problem of long time consumption in use. In addition, if only the spelling errors are corrected, the current method mainly adopts a sequence marking method, can quickly correct the wrongly written characters, but is not suitable for correcting other errors.
Disclosure of Invention
The invention aims to provide a text error correction method based on BERT and a feedforward neural network, aiming at the defects of the prior art. The invention adopts a simple model to identify and correct various errors in the text.
The purpose of the invention is realized by the following technical scheme: a text error correction method based on BERT and feedforward neural network comprises the following steps:
1) and preprocessing the text error correction corpus data.
2) And (2) carrying out BERT coding on the input text preprocessed in the step 1) to obtain feature representation and semantic representation.
3) And judging whether the text is a correct text or not based on the semantic representation of the input text obtained in the step 2).
4) And detecting the position of an error in the text based on the input text feature representation obtained in the step 2) and the judgment result in the step 3).
5) And generating a correct text corresponding to the error text based on the error position found in the step 4).
Further, in the step 1), the data preprocessing method comprises:
1.1) carrying out preprocessing operation on the acquired text data.
1.2) performing word segmentation on the text, and if the text is Chinese, performing word segmentation according to the unit of characters; if the English is English, word segmentation is carried out according to the word piece form.
1.3) add a special character "[ CLS ]" at the beginning of the text and a special character "[ SEP ]" at the end.
1.4) if the text is training data, calculating a text wrong label, a character wrong type label and a label of a wrong corresponding correct text by comparing the segmented source character string and the segmented target character string.
Further, in the step 2), based on the BERT encoded text representation:
2.1) pre-training word vectors and position vectors by using BERT, embedding words and positions into an input text, and obtaining a text preliminary vector representation:
Figure BDA0002915106000000021
wherein E iswordIs a word-embedding matrix, EposIs a position-embedding matrix, where the size of the word-embedding matrix is [ V, E ]]V is the word list size defined by BERT, E is the embedding dimension, and the size of the position embedding matrix is [512, E]。
2.2) Using L layer Transformer Module in BERTObtaining semantic feature representations for each character
Figure BDA0002915106000000022
Figure BDA0002915106000000023
The calculation method comprises the following steps:
Hl=Transformer(Hl-1)
2.3) Using "[ CLS]Character corresponding feature
Figure BDA0002915106000000024
Obtaining the text overall semantic representation:
Figure BDA0002915106000000025
further, in the step 3), it is determined whether the text is a correct text:
3.1) selecting the text integral semantic representation c output by the BERT in the step 2.3) as the characteristic for judging the text error.
3.2) using a feedforward neural network to map c to a numerical value, and then using a sigmoid function to calculate the probability that the predicted text is incorrect:
Prw=sigmoid(Wrwc+brw)
wherein, WrwAnd brwAre weight parameters for deep learning model learning.
3.3) adding PrwAnd manually set threshold
Figure BDA0002915106000000026
And comparing to judge whether the text is wrong or not, and if the text is smaller than the threshold value, determining that the text is correct.
And 3.4) for the input text judged to be correct, directly outputting the input text as an error correction result without performing subsequent error correction operation.
3.5) during model training, calculating the loss value of the text to misjudgment by using a binary cross entropy loss function:
Lossrw=BCELoss(Prw,yrw)
wherein, yrwThe true value of the text error is obtained by comparing whether the source character string and the target character string are equal or not.
Further, in the step 4), the position of the error in the text is detected:
4.1) selecting the characteristic representation H of each word output by BERT in step 2.2)LAs a feature of the error type detection.
4.2) defining the type of each character as correct, redundant, correct but later missing content and wording error, wherein each character corresponds to one of the four types, and the corresponding operation labels are respectively reserved, deleted, added and replaced later.
4.3) carrying out sequence labeling on the input text by using a feedforward neural network and combining a softmax function, and detecting the operation required to be carried out on each character:
Figure BDA0002915106000000031
Figure BDA0002915106000000032
wherein the content of the first and second substances,
Figure BDA0002915106000000033
respectively representing the probability of one character to be reserved, deleted, added and replaced, and taking the operation corresponding to the maximum probability as the detection result.
4.4) after obtaining the predicted tag sequence of the input text, obtaining the error position pos in the text as (s ', e') based on the rule: for a continuous deleted label or a continuous replaced label, the derived error starting position s 'and the derived error ending position e' are the previous position and the next position of the position interval [ s, e ] where the continuous label is located, i.e. s '-s-1, e' -e + 1; for each tag in the sequence of tags to be added later, the derived error start position s 'is the position s of the tag itself, and the error end position e' is a position after the start position, i.e. e '═ s' + 1.
4.5) for the input text of which the prediction sequence labels are all formed by the reserved labels, the subsequent error correction operation is not carried out, and the input text is directly output as an error correction result.
4.6) during model training, calculating the loss value of the sequence label by using a cross entropy loss function:
Figure BDA0002915106000000034
wherein the content of the first and second substances,
Figure BDA0002915106000000035
is the real operation tag corresponding to the i1 th character.
Further, in the step 5), a correct text corresponding to the wrong text is generated:
5.1) all character feature vectors H obtained from step 2.2) depending on the error position obtained in step 4.4)LTruncating the input feature vector for correct text generation:
Figure BDA0002915106000000036
Figure BDA0002915106000000037
where s and e are the starting and ending positions of an error,
Figure BDA0002915106000000038
is the above information that precedes the error,
Figure BDA0002915106000000039
for context information after error, hmidAs the message of the error itselfWhen the error start position and the error end position are adjacent to each other, hmidA special vector h for self-learning of a modelempOtherwise hmidIs based on the average of the vectors between the error start and end positions
Figure BDA00029151060000000310
Obtained hinfoThe number is equal to the number of errors detected in step 4.4).
5.2) defining a deep learning model learned position embedding matrix E'posThe system is used for controlling characters corresponding to different positions of a generated text and consists of MAX _ LEN vectors with dimension being POS _ DIM, and MAX _ LEN represents the maximum length of the generated text. In correcting each error, use the j dimension of the matrix as vector E 'of POS _ DIM'pos(j) As the position information when the j-th word is generated.
5.3) extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedding vectors:
hi3,j=MLP([hinfo,E′pos(j)])
wherein h isi3,jIs the characteristic that the i3 th error in the text corresponds to the j' th correct word generated.
5.4) combining a softmax function, mapping the error characteristic representation into a dictionary size dimension defined by BERT, and taking the word with the highest probability in the dictionary as the generated jth word:
Figure BDA0002915106000000041
Figure BDA0002915106000000042
the weight parameter used by the last layer of the multi-layer feedforward network is a word embedding matrix E in BERTwordThe transposing of (1).
5.5) during model training and use, MAX _ LEN words of corrected text are generated for an error, but only the text before the special character 'EOP' appears in the generated text is intercepted as a result, and if the special character 'EOP' is not generated, all the generated text is taken as a result.
5.6) using the error position detected in the step 4.4) and combining the generated correct text to replace the error content in the input text, and deleting the added special characters "[ CLS ]" and "[ SEP ]", thereby obtaining the final error correction output.
5.7) during model training, calculating a text generation loss value by using a cross entropy loss function:
Figure BDA0002915106000000043
wherein m is the number of errors in a text; k is a radical ofi3Is the length of the i3 th error corresponding to the correct text, including the end character "[ EOP ]]”;
Figure BDA0002915106000000044
Is the jth word in the correct text that is corresponding to the i3 th error.
The invention has the beneficial effects that: the method uses the sequence label for the processing object of text error correction, so that various types of errors can be quickly and accurately corrected by using a sequence label method, and the method is not limited to spelling error correction; the invention carries out text error correction based on BERT, can carry out error correction on wrong texts in large-scale linguistic data and generate correct texts, simultaneously improves the problem of long time consumption of the traditional error correction method based on a translation model, and optimizes the serial process of text error correction for generating correct sentences word by word into a parallel process of error correction only aiming at wrong contents by using a feedforward neural network.
Drawings
FIG. 1 is a flow chart of a proposed method of the present invention;
FIG. 2 is a diagram of a text error correction model architecture designed by the present invention;
FIG. 3 is a diagram of the internal structure of the BERT model employed in the present invention.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings and specific examples.
The text error correction method based on the BERT and the feedforward neural network uses a deep learning method and combines a pre-training language model BERT, and semantic information such as word part of speech, syntactic structure and the like can be effectively extracted from a text, so that the feature representation of each word in the context is obtained. In addition, three different feedforward neural networks designed by the invention can respectively perform the functions of text misjudgment, error position detection and correct text generation by utilizing the extracted characteristic information, and organically combine all modules together, thus realizing the purpose of text error correction; as shown in fig. 1, the method comprises the following steps:
1. data pre-processing
For text error correction corpus data, firstly performing word segmentation operation on each text; particularly, for Chinese text, segmentation is carried out according to the unit of a character; for English words, besides segmenting each word according to a blank space, each word is segmented into a word piece form by using a statistical result of large-scale English corpus. And after word segmentation, setting a stop word dictionary according to actual requirements, and filtering stop words in the text. In addition, it is necessary to add a special character "[ CLS ]" indicating the start of the text at the beginning of each text and a special character "[ SEP ]" indicating the end of the text at the end of the text.
For the training data set, the input text X ═ X needs to be input by comparison1,x2,…,xnI1 ═ 1 to n and target text T ═ T1,t2,…,tn′Obtaining three target values of model training, i2 being 1-n', respectively: label y whether text is correct or wrongrwTag sequence for marking error type
Figure BDA0002915106000000051
i1 ═ 1 to n and corrected target text
Figure BDA0002915106000000052
i3∈[1,m](ii) a Wherein:
label y whether text is correct or wrongrwE, e {0,1}, judging whether X and T are equal by a calculation method, and if the X and T are equal, indicating that the text is correct, taking the value as 0; otherwise, the sentence is wrong, and the value is 1.
Tag sequence elements for marking error types
Figure BDA0002915106000000053
Respectively indicating that the ith 1 character word in the input text X needs to be reserved (0), deleted (1), added with characters later (2) and replaced (3); the acquisition method comprises the steps of comparing X and T, analyzing an operation sequence converted from a text X to T by utilizing a sequence Matcher function in a Python self-contained difflib, wherein each operation sequence consists of 5 parts, namely an operation type and a starting position s of the XXEnd position e of XXStart position s of TTAnd end position e of TTIs shown to be
Figure BDA0002915106000000054
By the operation change to
Figure BDA0002915106000000055
The function relates to four operation types, namely 'equal', 'delete', 'insert' and 'place', which correspond to four labels. If the operation type is "equivalent", it means in X
Figure BDA0002915106000000056
Word and T in
Figure BDA0002915106000000057
Are the same, then handle
Figure BDA0002915106000000058
Are all set to 0, indicating that these words in X are to be retained; if the operation type is "delete", it means in X
Figure BDA0002915106000000059
If it needs to be deleted, then handle
Figure BDA00029151060000000510
Are all set to 1, indicating that these words in X are to be deleted; if the operation type is "insert", it indicates the s-th in XXThe s-th word needs to be inserted into T before the wordT~eTWord, this time sXAnd eXEqual, operation pointing to the(s) thX-1) words and an sXIn the middle of a word, then handle
Figure BDA00029151060000000511
Set to 2, indicating a need to
Figure BDA00029151060000000512
Adding a word after the word; if the operation type is "replace", it represents the s-th in XX~eXThe word needs to be replaced by the s-th in TT~eTWord, then handle
Figure BDA00029151060000000513
Are set to 3 indicating that these words in X need to be replaced.
Wherein the corrected target text
Figure BDA0002915106000000061
The obtaining method needs to use the operation sequence, and if m operations which are not in the 'equivalent' type exist in the operation sequence, the error correction text corresponding to m targets
Figure BDA0002915106000000062
If the operation type is "insert" or "replace", then
Figure BDA0002915106000000063
Is s in TT~(eT-1) words plus a special end character "[ EOP ]]"; if the operation type is "delete", then
Figure BDA0002915106000000064
For special characters "[ NONE]"and" [ EOP]”。YslAnd
Figure BDA0002915106000000065
have corresponding relation between them, which is reflected on deleting, adding and replacing labels and corresponding error correcting content except 'equivalent' operation. For each target text
Figure BDA0002915106000000066
Wherein the last character
Figure BDA0002915106000000067
Are all special characters representing the end "[ EOP ]]”。
2. BERT coding
The operation of BERT coding the input text X is mainly divided into two steps:
the first step is word embedding, which converts each word in X into a vector representation, using the two embedding matrices defined by BERT, respectively the word embedding matrix EwordAnd a position embedding matrix Epos(ii) a Wherein the size of the word embedding matrix is [ V, E ]]V is the vocabulary size defined by BERT, E is the embedding dimension; the size of the position embedding matrix is [512, E ]]. The calculation method for word embedding in BERT comprises the following steps:
Figure BDA0002915106000000068
the second step is a self-attention coding expression module based on the Transfomer, which is composed of L-layer Transfomer modules, each layer is calculated in the same way, and the output of the previous layer is used as the input:
Hl=Transformer(Hl-1)
wherein L is 1 to L; the input of the first layer is the text after embedding words
Figure BDA0002915106000000069
i1 is 1 to n. Finally, the output of the BERT L layer transform module is taken
Figure BDA00029151060000000610
As a feature representation of the input text X.
In addition, BERT pre-trains the first input character "[ CLS ]]"corresponding output
Figure BDA00029151060000000611
The feature expression vector c as the prediction judgment of the next sentence can be used as the semantic expression of the whole input text X, and the calculation method is as follows:
Figure BDA00029151060000000612
wherein, tanh is an activation function, WcParameter matrices learned for the model, bcBias vectors learned for the model.
3. Text error judgment
Judging whether the input text X is a correct text, if so, not performing subsequent error position detection and text error correction operation, wherein the judgment method is to utilize the semantic representation c of the input text obtained in the step 2 and combine a feedforward neural network to perform two classification tasks of error judgment:
Prw=sigmoid(Wrwc+brw)
wherein, the output result of the two classification tasks is the error probability P of the input textrw∈[0,1]If it is greater than the set threshold value
Figure BDA00029151060000000613
The input text is considered to be wrong, otherwise, the judgment is correct; wrwAnd brwIs a weight parameter for deep learning model learning; sigmoid is an activation function.
During model training, a binary cross entropy loss function BCELoss is used for calculating a loss value of text misjudgment:
Lossrw=BCELoss(Prw,yrw)
4. error location detection
Test articleAnd (4) determining which positions in the text are wrong, and labeling the corresponding positions according to the error types. Processing the feature representation H of the input text obtained in the step 2 by a feedforward neural network by adopting a sequence labeling methodL
Figure BDA0002915106000000071
Figure BDA0002915106000000072
Wherein i1 is 1 to n; wslParameter matrices learned for the model, bslA bias vector for model learning;
Figure BDA0002915106000000073
is a vector composed of 4 elements, which respectively represent the probability of reserving, deleting, adding and replacing the i1 th word in the input text, and corresponds to
Figure BDA0002915106000000074
Taking the operation corresponding to the value with the maximum probability as the result of sequence labeling; softmax is the activation function.
During model training, the cross entropy loss function CrossEntropLosss is used to calculate the loss value of the sequence annotation:
Figure BDA0002915106000000075
after the feedforward neural network is used for sequence marking, and the operation required to be carried out on each word is detected, the starting position and the ending position of the error in the text can be calculated. Suppose that { X ] is input among n operations obtained after inputting text X of n characterss,…,xeThe corresponding operations of (e is more than or equal to s) are the same and are not reserved operations; then if the operation is delete, define the error position as pos ═ s-1, e + 1; wherein the content of the first and second substances,s-1 and e +1 represent the starting and ending positions of the error, respectively; if the operation is replacement, defining the error position as pos ═ s-1, e + 1; if the operation is added later, e-s +1 error locations, pos respectively, are definedi4=(s+i4-1,s+i4),i4∈[1,e-s+1]Similarly, s + i4-1 and s + i4 represent the starting and ending positions of the error. After the position was calculated as described above, pos ═ s ', e' was assigned for convenience.
In the invention, the defined error starting position is moved forward by one unit than the actual position, and the defined error ending position is moved backward by one unit than the actual position, so that the input data for generating the correct text can be conveniently acquired, and the processing flows of various error types can be unified. For multiple word errors and word errors, deletion and replacement operations need to be carried out on the multiple word errors and the word errors, the positions of errors can be directly defined on wrong words in a text, and for few word errors, missing words need to be inserted between two words, the two words have no errors, so that the range of the positions of errors needs to be expanded by one word, namely the initial position is the wrong previous word, and the end position is the wrong next word.
5. Correct text generation
After the error position is known, based on the context semantic information of the error, generating correct texts corresponding to the errors in parallel by a feedforward neural network method; the method needs to be carried out in three steps:
firstly, intercepting the text representation H obtained in the step 2 according to the error position obtained in the step 4L. For an error position pos in the text being (s ', e'), the states of the error starting position s 'and the error ending position e' are taken out first
Figure BDA0002915106000000081
And
Figure BDA0002915106000000082
then taking out the intermediate state vector
Figure BDA0002915106000000083
And taking the average value of the vectors to obtain the characteristic vector of the error:
Figure BDA0002915106000000084
where mean represents the averaging. In this step, if few word errors are processed, s '+ 1 ═ e', i.e. there is no intermediate state vector, for which an initial value is set to a random value, and the variable h optimized in the model training is setempInstead of h in this casemid. Finally, the three vectors are spliced together to serve as context information of error content:
Figure BDA0002915106000000085
and secondly, extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedded vectors. Due to the need to utilize an erroneous context information hinfoRapid generation of correct text composed of multiple words or special characters, so a new position embedding matrix E 'is set'posFor distinguishing the content of each word generated, the embedded dimension of the matrix is POS _ DIM. Feature h extraction using two-layer feedforward neural networki3,jAn activation function MLP and normalization operations are also set in each layer:
hi3,j=MLP([hinfo,E′pos(j)])
wherein i3 refers to the i3 th error in the text, and j refers to the j th word generated by the i3 th error target. When the model is actually used for error correction, the correct text corresponding to the error is not known to be formed by a plurality of words, so that MAX _ LEN words are uniformly limited to be generated during training and using the model, and the words generated before the special character 'EOP' in the generated text are intercepted as an error correction result according to the position where the special character 'EOP' appears. If no special character "[ EOP ]" is generated, all generated text is used as a result.
Third, using a feedforward neural network to derive the feature hi3,jExtract the correct text of the target
Figure BDA0002915106000000086
Figure BDA0002915106000000087
Figure BDA0002915106000000088
Wherein j is 1 to ki3(ii) a Formula (a) toi,jMapping the word to a vector with the size of V, and carrying out normalization operation, namely obtaining a probability distribution of the jth word generated when the ith 2 th error is corrected in a BERT definition dictionary, and taking the word corresponding to the value with the highest probability in the dictionary as the jth word generated.
Through the operation, the correct text corresponding to the error detected in the step 4 can be obtained, and the original text can be taken back to be modified by combining the position of the error in the input text, so that the text after error correction can be obtained.
During model training, the cross entropy loss function crossentryploss is used to calculate the loss value of text generation:
Figure BDA0002915106000000089
where m is the number of errors corresponding to a sample, ki3Corresponding to correct text for the i3 th error (including the special character "[ SOP ]]") length, MAX _ LEN words are output for each error during training, but only the top k is calculated when calculating the penalty valuei3Loss value of each output.
In order to embody the universality of the invention for correcting various text error types, the text error correction method provided by the invention is explained by combining four text examples. Assuming that there is a correct input text X1 ═ protection intellectual property; an input text X2 with few word errors is "protected knowledge"; an input text with wording error X3 ═ protection of intellectual property right; an input text X4 with multiple word errors is "protected intellectual property", and how to correct the four texts is described below.
As shown in fig. 2 to 3, the method of the present embodiment comprises five steps:
step 1: preprocessing input text, performing word segmentation on Chinese text by taking a word as a unit, and adding a special character ([ CLS ] ") at the beginning of the text and a special character ([ SEP ]") at the end of the text.
The obtained processing results are X1 [ "[ CLS ]", "protection", "know", "identify", "produce", "weight", "[ SEP ]" ], X2 [ "[ CLS ]", "protection", "know", "identify", "" [ SEP ] "], X3 [" [ CLS ] "," protection "," know "," only "," produce "," weight "," [ SEP ] "], X4 [" [ CLS ] "," protection "," know "," identify "," produce "," weight "," [ SEP ] "].
It is assumed that the error correction model proposed by the present invention is trained and is in the stage of performing error correction by using the model, so that it is not necessary to calculate the target value required by the model training.
Step 2: input text is encoded by using BERT, and a basic version of Chinese pre-training BERT model is assumed to be used, wherein the word embedding dimension of the BERT model is 768, the size of a word table is 21128, and 12 layers of Transformer modules are stacked together.
The processing flow comprises the steps of embedding words and positions in texts, inputting the embedded texts into a Transformer module, taking the output of the last layer of Transformer module, and obtaining the overall characteristic representation c of each text and the semantic representation of each word in each text in the text
Figure BDA0002915106000000091
i1∈[1,n]Where n is the text length after the preprocessing of step 1, such as for the text X1, the text length is8。
And step 3: judging the text mismatching, inputting the text integral representation c obtained in the step 2 into a feedforward neural network, and then inputting a sigmoid function to obtain the probability P of the text mismatchingrwE (0, 1). Comparing the error probability with a manually set threshold
Figure BDA0002915106000000092
And comparing, and if the comparison result is larger than the threshold value, considering that the text is wrong, and otherwise, considering that the text is correct.
For example, the above four texts have prediction results of 0.1, 0.8, 0.9 and 0.8, respectively, and the threshold value set is 0.5, the first text X1 is considered to be correct, and the other three texts are wrong.
And for the text with correct prediction, directly taking the original input text as the corrected output of the model, and not performing subsequent error correction operation.
And 4, step 4: the sequence marks the location of the error. For the text containing errors detected in step 3, the specific location of the error is detected in this step by using a feed forward neural network to semantically represent each word obtained in step 2
Figure BDA0002915106000000093
The operation label probability required to be carried out on each word can be obtained by carrying out classification and combining with the softmax function
Figure BDA0002915106000000094
And finally, taking the position corresponding to the maximum probability value.
The operation types are four types including retention, deletion, addition and replacement, and their numerical values are respectively represented as 0,1, 2 and 3. For text X2, the probability assumption corresponding to the "recognition" word therein is
Figure BDA0002915106000000101
The type of operation that needs to be performed on it is 2, i.e. the word is added after the word is "recognized". Similarly, the operation sequence of the text X2 can be obtained as [0,0,0,0,2,0 ]]The operation sequence of the text X3 is [0,0,0,0,3,0,0]The operation sequence of the text X2 is [0,0,1,0,0,0,0 ]]。
After obtaining the operation sequence, the starting position and the ending position of the error in the input text can be calculated, such as that the error position of the text X2 is (5,6), that of X3 is (4,6), and that of X4 is (2, 4).
If a text has no error position detected in the step, that is, the obtained operation sequence is composed of all 0 s, the text is considered to be correct, and the text is directly output as the error correction result of the model.
And 5: the correct text is generated. This step requires obtaining a semantic representation of the words using step 2
Figure BDA0002915106000000102
And step 4, obtaining error positions, firstly taking out semantic expression vectors between corresponding positions according to the error positions, then keeping vectors at two ends as context information of error start and error end, taking the average value of intermediate vectors as the characteristic information of the error, and if no intermediate vector exists, using a vector h representing a null stateempRepresenting, finally splicing the three vectors to obtain wrong representing information hinfo. For example, the error indication information corresponding to the text X2 is
Figure BDA0002915106000000103
Then by giving hinfoPosition vector E 'with splicing embedding dimension POS _ DIM of 200'posAnd inputting a twice feedforward neural network to obtain an intermediate representation vector of a j word to be generated. Setting the maximum length MAX _ LEN of the text that can be generated to 3, a corrected text of 3 words is generated for each error, and for errors in the text X2, the input at step 5 is
Figure BDA0002915106000000104
And
Figure BDA0002915106000000105
finally word embedding using BERT itselfInto matrix EwordAnd a softmax function, which maps the intermediate vector to a word in the dictionary to complete the generation of the correct text. The three words finally generated are respectively ' produce ', ' weight ' and ' [ EOP ]]Adding two characters of 'property right' to the back of the original input text 'protection intellectual property', and finally obtaining the text after error correction as 'protection intellectual property right'; for text X3, a similar would generate "identify", "[ EOP]"and" is "three words, but here only" [ EOP ]]"preceding words are the result and following words are discarded.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (6)

1. A text error correction method based on BERT and feedforward neural network is characterized by comprising the following steps:
1) and preprocessing the text error correction corpus data.
2) And (2) carrying out BERT coding on the input text preprocessed in the step 1) to obtain feature representation and semantic representation.
3) And judging whether the text is a correct text or not based on the semantic representation of the input text obtained in the step 2).
4) And detecting the position of an error in the text based on the input text feature representation obtained in the step 2) and the judgment result in the step 3).
5) And generating a correct text corresponding to the error text based on the error position found in the step 4).
2. The text error correction method based on the BERT and feedforward neural network as claimed in claim 1, wherein in the step 1), the data preprocessing method:
1.1) carrying out preprocessing operation on the acquired text data.
1.2) performing word segmentation on the text, and if the text is Chinese, performing word segmentation according to the unit of characters; if the English is English, word segmentation is carried out according to the word piece form.
1.3) add a special character "[ CLS ]" at the beginning of the text and a special character "[ SEP ]" at the end.
1.4) if the text is training data, calculating a text wrong label, a character wrong type label and a label of a wrong corresponding correct text by comparing the segmented source character string and the segmented target character string.
3. The method of text error correction based on BERT and feedforward neural networks as claimed in claim 2, wherein in said step 2), the text representation is encoded based on BERT:
2.1) pre-training word vectors and position vectors by using BERT, embedding words and positions into an input text, and obtaining a text preliminary vector representation:
Figure FDA0002915105990000011
wherein E iswordIs a word-embedding matrix, EposIs a position-embedding matrix, where the size of the word-embedding matrix is [ V, E ]]V is the word list size defined by BERT, E is the embedding dimension, and the size of the position embedding matrix is [512, E]。
2.2) obtaining semantic feature representation of each character by utilizing an L-layer Transformer module in BERT
Figure FDA0002915105990000012
Figure FDA0002915105990000013
The calculation method comprises the following steps:
Hl=Transformer(Hl-1)
2.3) Using "[ CLS]Character corresponding feature
Figure FDA0002915105990000015
Obtaining the text overall semantic representation:
Figure FDA0002915105990000014
4. the text error correction method based on BERT and feedforward neural networks as claimed in claim 3, wherein in the step 3), it is determined whether the text is a correct text:
3.1) selecting the text integral semantic representation c output by the BERT in the step 2.3) as the characteristic for judging the text error.
3.2) using a feedforward neural network to map c to a numerical value, and then using a sigmoid function to calculate the probability that the predicted text is incorrect:
Prw=sigmoid(Wrwc+brw)
wherein, WrwAnd brwAre weight parameters for deep learning model learning.
3.3) adding PrwAnd manually set threshold
Figure FDA0002915105990000024
And comparing to judge whether the text is wrong or not, and if the text is smaller than the threshold value, determining that the text is correct.
And 3.4) for the input text judged to be correct, directly outputting the input text as an error correction result without performing subsequent error correction operation.
3.5) during model training, calculating the loss value of the text to misjudgment by using a binary cross entropy loss function:
Lossrw=BCELoss(Prw,yrw)
wherein, yrwThe true value of the text error is obtained by comparing whether the source character string and the target character string are equal or not.
5. The text error correction method based on the BERT and feedforward neural networks as claimed in claim 4, wherein in the step 4), the position of the error in the text is detected:
4.1) selecting the characteristic representation H of each word output by BERT in step 2.2)LAs a feature of the error type detection.
4.2) defining the type of each character as correct, redundant, correct but later missing content and wording error, wherein each character corresponds to one of the four types, and the corresponding operation labels are respectively reserved, deleted, added and replaced later.
4.3) carrying out sequence labeling on the input text by using a feedforward neural network and combining a softmax function, and detecting the operation required to be carried out on each character:
Figure FDA0002915105990000021
Figure FDA0002915105990000022
wherein the content of the first and second substances,
Figure FDA0002915105990000023
respectively representing the probability of one character to be reserved, deleted, added and replaced, and taking the operation corresponding to the maximum probability as the detection result.
4.4) after obtaining the predicted tag sequence of the input text, obtaining the error position pos in the text as (s ', e') based on the rule: for a continuous deleted label or a continuous replaced label, the derived error starting position s 'and the derived error ending position e' are the previous position and the next position of the position interval [ s, e ] where the continuous label is located, i.e. s '-s-1, e' -e + 1; for each tag in the sequence of tags to be added later, the derived error start position s 'is the position s of the tag itself, and the error end position e' is a position after the start position, i.e. e '═ s' + 1.
4.5) for the input text of which the prediction sequence labels are all formed by the reserved labels, the subsequent error correction operation is not carried out, and the input text is directly output as an error correction result.
4.6) during model training, calculating the loss value of the sequence label by using a cross entropy loss function:
Figure FDA0002915105990000031
wherein the content of the first and second substances,
Figure FDA0002915105990000032
is the real operation tag corresponding to the i1 th character.
6. The text error correction method based on the BERT and feedforward neural networks as claimed in claim 5, wherein in the step 5), the correct text corresponding to the wrong text is generated:
5.1) all character feature vectors H obtained from step 2.2) depending on the error position obtained in step 4.4)LTruncating the input feature vector for correct text generation:
Figure FDA0002915105990000033
Figure FDA0002915105990000034
where s and e are the starting and ending positions of an error,
Figure FDA0002915105990000035
is the above information that precedes the error,
Figure FDA0002915105990000036
for context information after error, hmidFor error information, when the error startsWhen the position and the end position are adjacent, hmidA special vector h for self-learning of a modelempOtherwise hmidIs based on the average of the vectors between the error start and end positions
Figure FDA0002915105990000037
Obtained hinfoThe number is equal to the number of errors detected in step 4.4).
5.2) defining a deep learning model learned position embedding matrix E'posThe system is used for controlling characters corresponding to different positions of a generated text and consists of MAX _ LEN vectors with dimension being POS _ DIM, and MAX _ LEN represents the maximum length of the generated text. In correcting each error, use the j dimension of the matrix as vector E 'of POS _ DIM'pos(j) As the position information when the j-th word is generated.
5.3) extracting correct text features by utilizing a multilayer feedforward neural network and combining error information and position embedding vectors:
hi3,j=MLP([hinfo,E′pos(j)])
wherein h isi3,jIs the characteristic that the i3 th error in the text corresponds to the j' th correct word generated.
5.4) combining a softmax function, mapping the error characteristic representation into a dictionary size dimension defined by BERT, and taking the word with the highest probability in the dictionary as the generated jth word:
Figure FDA0002915105990000038
Figure FDA0002915105990000039
the weight parameter used by the last layer of the multi-layer feedforward network is a word embedding matrix E in BERTwordThe transposing of (1).
5.5) during model training and use, MAX _ LEN words of corrected text are generated for an error, but only the text before the special character 'EOP' appears in the generated text is intercepted as a result, and if the special character 'EOP' is not generated, all the generated text is taken as a result.
5.6) using the error position detected in the step 4.4) and combining the generated correct text to replace the error content in the input text, and deleting the added special characters "[ CLS ]" and "[ SEP ]", thereby obtaining the final error correction output.
5.7) during model training, calculating a text generation loss value by using a cross entropy loss function:
Figure FDA0002915105990000041
wherein m is the number of errors in a text; k is a radical ofi3Is the length of the i3 th error corresponding to the correct text, including the end character "[ EOP ]]”;
Figure FDA0002915105990000042
Is the jth word in the correct text that is corresponding to the i3 th error.
CN202110098015.6A 2021-01-25 2021-01-25 Text error correction method based on BERT and feedforward neural network Active CN112836496B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110098015.6A CN112836496B (en) 2021-01-25 2021-01-25 Text error correction method based on BERT and feedforward neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110098015.6A CN112836496B (en) 2021-01-25 2021-01-25 Text error correction method based on BERT and feedforward neural network

Publications (2)

Publication Number Publication Date
CN112836496A true CN112836496A (en) 2021-05-25
CN112836496B CN112836496B (en) 2024-02-13

Family

ID=75931336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110098015.6A Active CN112836496B (en) 2021-01-25 2021-01-25 Text error correction method based on BERT and feedforward neural network

Country Status (1)

Country Link
CN (1) CN112836496B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743101A (en) * 2021-08-17 2021-12-03 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and computer storage medium
CN114065738A (en) * 2022-01-11 2022-02-18 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning
CN115169330A (en) * 2022-07-13 2022-10-11 平安科技(深圳)有限公司 Method, device, equipment and storage medium for correcting and verifying Chinese text
CN115547313A (en) * 2022-09-20 2022-12-30 海南大学 Method for controlling sudden stop of running vehicle based on voice of driver
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 Text error correction method, device and medium based on intention consistency
CN116306589A (en) * 2023-05-10 2023-06-23 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN112131352A (en) * 2020-10-10 2020-12-25 南京工业大学 Method and system for detecting bad information of webpage text type
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN112131352A (en) * 2020-10-10 2020-12-25 南京工业大学 Method and system for detecting bad information of webpage text type

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王辰成;杨麟儿;王莹莹;杜永萍;杨尔弘;: "基于Transformer增强架构的中文语法纠错方法", 中文信息学报, no. 06 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743101A (en) * 2021-08-17 2021-12-03 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and computer storage medium
CN114065738A (en) * 2022-01-11 2022-02-18 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning
CN115169330A (en) * 2022-07-13 2022-10-11 平安科技(深圳)有限公司 Method, device, equipment and storage medium for correcting and verifying Chinese text
CN115169330B (en) * 2022-07-13 2023-05-02 平安科技(深圳)有限公司 Chinese text error correction and verification method, device, equipment and storage medium
CN115547313A (en) * 2022-09-20 2022-12-30 海南大学 Method for controlling sudden stop of running vehicle based on voice of driver
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 Text error correction method, device and medium based on intention consistency
CN116136957B (en) * 2023-04-18 2023-07-07 之江实验室 Text error correction method, device and medium based on intention consistency
CN116306589A (en) * 2023-05-10 2023-06-23 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene
CN116306589B (en) * 2023-05-10 2024-02-09 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene

Also Published As

Publication number Publication date
CN112836496B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN112836496B (en) Text error correction method based on BERT and feedforward neural network
CN108959252B (en) Semi-supervised Chinese named entity recognition method based on deep learning
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN111626063B (en) Text intention identification method and system based on projection gradient descent and label smoothing
CN110334354B (en) Chinese relation extraction method
CN110569508A (en) Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
CN108717574B (en) Natural language reasoning method based on word connection marking and reinforcement learning
CN110008472B (en) Entity extraction method, device, equipment and computer readable storage medium
CN110263325B (en) Chinese word segmentation system
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN110008469A (en) A kind of multi-level name entity recognition method
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN112463924B (en) Text intention matching method for intelligent question answering based on internal correlation coding
CN114863429A (en) Text error correction method and training method based on RPA and AI and related equipment thereof
CN112417823B (en) Chinese text word order adjustment and word completion method and system
CN112183083A (en) Abstract automatic generation method and device, electronic equipment and storage medium
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
CN114692568A (en) Sequence labeling method based on deep learning and application
CN114548099A (en) Method for jointly extracting and detecting aspect words and aspect categories based on multitask framework
CN111898337B (en) Automatic generation method of single sentence abstract defect report title based on deep learning
CN111858894A (en) Semantic missing recognition method and device, electronic equipment and storage medium
CN116681061A (en) English grammar correction technology based on multitask learning and attention mechanism
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN112131879A (en) Relationship extraction system, method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant