CN115169330A - Method, device, equipment and storage medium for correcting and verifying Chinese text - Google Patents
Method, device, equipment and storage medium for correcting and verifying Chinese text Download PDFInfo
- Publication number
- CN115169330A CN115169330A CN202210824618.4A CN202210824618A CN115169330A CN 115169330 A CN115169330 A CN 115169330A CN 202210824618 A CN202210824618 A CN 202210824618A CN 115169330 A CN115169330 A CN 115169330A
- Authority
- CN
- China
- Prior art keywords
- error correction
- text
- model
- error
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000012937 correction Methods 0.000 claims abstract description 489
- 238000012549 training Methods 0.000 claims abstract description 138
- 238000012795 verification Methods 0.000 claims abstract description 59
- 238000002372 labelling Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 84
- 230000006870 function Effects 0.000 claims description 30
- 238000013145 classification model Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 abstract description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 8
- 239000011159 matrix material Substances 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 239000004973 liquid crystal related substance Substances 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention relates to an artificial intelligence technology, and discloses a Chinese text error correction and verification method, which comprises the following steps: labeling a template text to a text in an original error correction training text set to obtain a standard error correction training text set, performing combined training on a two-stage error correction model comprising a text error recognition model and a text error correction model by using the standard error correction training text set to obtain a standard error correction model, performing error correction on the text to be corrected by using the standard error correction model to obtain an error corrected text, constructing an error correction pair, performing error correction type recognition on the error correction pair, and performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result. Furthermore, the invention also relates to a blockchain technology, and the error correction verification result can be stored in a node of the blockchain. The invention also provides a Chinese text error correction and verification device, electronic equipment and a readable storage medium. The invention can solve the problem of low Chinese error correction efficiency.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for correcting and verifying Chinese text, an electronic device and a readable storage medium.
Background
The Chinese error correction is an important application in artificial intelligence, most of the Chinese error correction methods commonly used in the industry traverse each sentence to perform Chinese error correction, and the following two methods are mainly used: 1. and calculating the editing cost, such as adding, deleting, replacing and other editing values, based on an editing distance algorithm, and traversing and comparing the editing cost with a correct sentence library so as to finish the error correction process. However, this method is mechanical, a huge and correct sentence library needs to be preset, and all sentences are used in the sentence level, the whole sentence is input for the edit distance calculation, since the edit distance algorithm also calculates the edit distance cost for the correct part of the sentence, the calculation cost is very high, the time consumption of the error correction process is very long, and then the error correction cannot be performed on the unregistered sentence or word, and the error correction efficiency is low. 2. Chinese character error correction is carried out by using a language model, for example, error correction is carried out by using a model of a coder and a decoder structure, but in the decoding process, the sentences after error correction are obtained by decoding one by one in sequence, so that the efficiency is low; or a single BERT language model is used for text error correction, but all single words or a plurality of continuous combined words in a sentence still need to be traversed to form a mask, so that the model guesses the position of the mask to achieve the error correction effect, and the efficiency is very low.
Disclosure of Invention
The invention provides a method and a device for correcting and verifying Chinese text, electronic equipment and a readable storage medium, and mainly aims to solve the problem of low efficiency of correcting Chinese text.
In order to achieve the above object, the present invention provides a method for correcting and verifying a chinese text, comprising:
acquiring an original error correction training text set, and labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set;
constructing a two-stage error correction model comprising a text error recognition model and a text error correction model;
performing joint training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type;
and performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result.
Optionally, the constructing a two-stage error correction model including a text error recognition model and a text error correction model includes:
acquiring a first BERT model, and splicing a full connection layer and an output layer behind the first BERT model to obtain the text error recognition model;
and acquiring a second BERT model as the text error correction model, and connecting the text error correction models of the text error recognition model machine in series to obtain the double-stage error correction model.
Optionally, the performing, by using the standard error correction training text set, joint training on the text error recognition model and the text error correction model to obtain a standard error correction model includes:
performing iterative training on the text error recognition model by using the standard error correction training text set;
outputting standard word vectors corresponding to sentences in the standard error correction training text set by using the trained text error recognition model;
copying and combining the standard word vectors, and performing attention training on the text error correction model based on the copied combined word vectors and a preset loss function;
and summarizing the trained text error recognition model and the trained text error correction model to obtain the double-stage error correction model.
Optionally, the iteratively training the text error recognition model by using the standard error correction training text set includes:
converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and performing covering treatment on preset positions in the word vectors to obtain covered word vectors;
extracting a standard word vector of the mask word vector by using the full connection layer, and outputting a predicted value of the standard word vector by using the output layer;
and calculating a loss value based on the predicted value, if the loss value is greater than or equal to a preset loss threshold value, updating parameters in the first BERT model, returning to the step of converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and stopping training until the loss value is less than the preset loss threshold value to obtain a trained text error recognition model.
Optionally, the performing error correction on the text to be corrected by using the standard error correction model to obtain an error-corrected text includes:
identifying the error probability of the text to be corrected by using a text error identification model in the standard error correction model;
if the error probability is smaller than a preset error threshold value, the text to be corrected is not processed;
and if the error probability is larger than or equal to the error threshold, performing text error correction on the text to be corrected by using a text error correction model in the standard error correction model to obtain an error-corrected text.
Optionally, the constructing an error correction pair based on the corrected text, and performing error correction type identification on the error correction pair by using a preset classification model to obtain an error correction type includes:
performing word segmentation processing on the corrected text and the text to be corrected corresponding to the corrected text;
and extracting words related to error correction after word segmentation to form an error correction pair, and outputting the error correction type of the error correction pair by using the classification model.
Optionally, the performing, based on the error correction type, error correction verification on the error correction pair by using an edit distance cost method to obtain an error correction verification result includes:
if the error correction type is the first error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method of adjusting characters;
if the error correction type is a second error correction type, calculating the editing cost of the error correction pair by using a keyboard-level editing distance cost method;
and determining that the error correction verification result of the error correction pair with the editing cost less than or equal to the preset cost threshold is error correction success, and determining that the error correction verification result of the error correction pair with the editing cost more than the preset cost threshold is error correction failure.
In order to solve the above problems, the present invention further provides a device for correcting and verifying a chinese text, the device comprising:
the error correction model training module is used for acquiring an original error correction training text set, labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set, constructing a two-stage error correction model comprising a text error recognition model and a text error correction model, and performing combined training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
the text error correction module is used for acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
the error correction type identification module is used for constructing an error correction pair based on the error corrected text and identifying the error correction type of the error correction pair by utilizing a preset classification model to obtain the error correction type;
and the error correction verification module is used for performing error correction verification on the error correction pair by utilizing an edit distance cost method based on the error correction type to obtain an error correction verification result.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
and the processor executes the computer program stored in the memory to realize the Chinese text error correction and verification method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the method for correcting and verifying chinese text as described above.
According to the method, the template text is labeled according to the correctness of the text in the original error correction training text set, a standard error correction training text set with richer information can be obtained, the standard error correction training text set is used for carrying out combined training on a two-stage error correction model comprising a text error recognition model and a text error correction model, the standard error correction model obtained by training is used for recognizing and correcting the text to be corrected, and the accuracy and the efficiency of text error correction are improved. Meanwhile, an error correction pair is constructed based on the corrected text, and error correction verification is performed on the error correction pair by using an edit distance cost method. Therefore, the method, the device, the electronic equipment and the computer readable storage medium for correcting and verifying the Chinese text can solve the problem of low efficiency of correcting the Chinese text.
Drawings
FIG. 1 is a flow chart of a method for correcting and verifying Chinese text according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing another step of FIG. 1;
FIG. 4 is a schematic flow chart showing another step of FIG. 1;
FIG. 5 is a schematic flow chart showing another step in FIG. 1;
FIG. 6 is a functional block diagram of an apparatus for correcting and verifying Chinese text according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device for implementing the method for correcting and verifying chinese text according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a Chinese text error correction and verification method. The execution subject of the method for correcting and verifying the chinese text includes, but is not limited to, at least one of the electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the present invention. In other words, the chinese text error correction and verification method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a method for correcting and verifying a chinese text according to an embodiment of the present invention. In this embodiment, the method for correcting and verifying the chinese text comprises the following steps S1 to S4:
s1, obtaining an original error correction training text set, and labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set.
In the embodiment of the present invention, the original error correction training text set is artificially labeled text data, and the format is that an incorrect sentence corresponds to a correct sentence, for example, "my living body is good-my body is good". The marking template text is constructed according to the correctness of the sentence, and the template text is 'the sentence is < >' and is used for distinguishing the correctness of the sentence.
In detail, the labeling template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set includes:
correctly labeling correct sentences in the original error correction training text set, and incorrectly labeling wrong sentences in the original error correction training text set;
summarizing all the labeled sentence texts to obtain the standard error correction training text set.
In an embodiment of the present invention, the template text is used to label different sentences, for example, the wrong sentence is labeled as "this sentence is wrong", and the correct sentence is labeled as "this sentence is right".
And S2, constructing a two-stage error correction model comprising a text error recognition model and a text error correction model.
In the embodiment of the present invention, the text error recognition model and the text error correction model are Mask Language Models (MLM) based on the BERT model, wherein the text error recognition model is used to recognize whether there are wrongly written or not in a sentence, and the text error correction model is used to correct a misspelled part.
In detail, referring to fig. 2, the constructing a two-stage error correction model including a text error recognition model and a text error correction model includes the following steps S20 to S21:
s20, obtaining a first BERT model, and splicing a full connection layer and an output layer behind the first BERT model to obtain the text error recognition model;
and S21, acquiring a second BERT model as the text error correction model, and connecting the text error correction model of the text error recognition model machine in series to obtain the double-stage error correction model.
In the embodiment of the invention, the first BERT model is a traditional BERT model, the input is a text, the words in the text are separately cut out according to the character sequence, and the word vectors and the characteristic vectors of [ CLS ] can be output through the first BERT model; the full-connection layer consists of two MLPs (a single MLP network structure is shown in the upper diagram and consists of two linear layers and a ReLu activation function) full-connection networks and is used for further extracting features; the output layer includes a sigmoid activation function for calculating the probability of a spelling error. The input to the second BERT model is a word vector of text and the output is a correct sentence.
And S3, performing combined training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model.
In the embodiment of the invention, because the text of the standard error correction training text set comprises the template text based on the text correctness marking, the accuracy of error correction identification can be improved through the masking training of the BERT model.
In detail, referring to fig. 3, the jointly training the text error recognition model and the text error correction model by using the standard error correction training text set to obtain the standard error correction model includes the following steps S30 to S33:
s30, performing iterative training on the text error recognition model by using the standard error correction training text set;
s31, outputting a standard word vector corresponding to a sentence in the standard error correction training text set by using the trained text error recognition model;
s32, copying and combining the standard word vectors, and performing attention training on the text error correction model based on the copied combined word vectors and a preset loss function;
and S33, summarizing the trained text error recognition model and the trained text error correction model to obtain the two-stage error correction model.
In the embodiment of the present invention, the preset loss function may be a cross entropy loss function. After a standard word vector corresponding to a sentence in a standard error correction training text set is output by using a trained text error recognition model, the standard word vector is copied into two parts and used as Q and K in a second BERT model, then the result of g x w is used as V, wherein g represents the mean value of the word vector, w is a parameter matrix, the matrix size of w is LxD, D is 512 dimensions, and L is the number of words in an input text. And performing self-attention calculation in the second BERT model by using the < Q, K, V >, and performing iterative training on the second BERT model based on a cross entropy loss function to obtain a trained text error correction model.
Specifically, the iteratively training the text error recognition model by using the standard error correction training text set includes:
converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and performing covering treatment on preset positions in the word vectors to obtain covered word vectors;
extracting a standard word vector of the mask word vector by using the full connection layer, and outputting a predicted value of the standard word vector by using the output layer;
and calculating a loss value based on the predicted value, if the loss value is greater than or equal to a preset loss threshold value, updating parameters in the first BERT model, returning to the step of converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and stopping training until the loss value is less than the preset loss threshold value to obtain a trained text error recognition model.
In an alternative embodiment of the present invention, the text error recognition model is actually a two-class classifier, for example: the 'my creature is good' is input into a first BERT model, an output mask word vector consists of three parts, the first part is a [ CLS ] vector, and the [ CLS ] is used for judging whether a sentence has spelling errors; the second is a word vector of each word, and "my creature is good" has 6 words, so a 6xD matrix is formed, each row in the matrix represents a word vector of a word, where D is 512 dimensions, an average value is calculated for each word vector to obtain g, and the g is multiplied by another matrix w, which is a learnable matrix, i.e. the values in the matrix can be updated through training the model each time iteration, and the matrix can reflect the highlighted value of each word after learning. Because the probability of the character which is possibly wrong is different for a sentence of input text, the ' my creature is good, and generally the ' me ' is not easy to make mistakes for the ' me ' and ' body ' and ' very ' and ' good ' relative to the ' body ', and the ' very ' character is easy to make mistakes for the ' me ' and ' body ' and ' good ', the w can learn the regular distribution through training data; the third is the masked template. Namely, the vector of the mask word output by the first BERT model is: [ CLS ], my, birth, body, very, good, [ sep ], which, in words, < Masked >, [ seq ]. Compared with the traditional BERT model, the vector equivalent to [ CLS ] and [ Masked ] is spliced to be used as a new [ CLS ] vector.
In an optional embodiment of the invention, after the [ CLS ] feature vector (namely the mask word vector) is obtained, the [ CLS ] feature vector is fed into two MLP fully-connected networks, and the probability of spelling error is calculated by a sigmoid activation function, wherein the [ CLS ] feature vector passes through the output of the sigmoid function, and the output is a decimal number between 0 and 1 and represents the error probability. A threshold of 0.72 may be used, and if greater than 0.72, a spelling error is deemed to be present, and if less than this, it is not present. At the same time, a cross entropy loss function is used to calculate the loss value.
In an alternative embodiment of the present invention, the calculation of the cross entropy loss function is a conventional technique, and is not described herein again.
S4, obtaining a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain the corrected text.
In the embodiment of the invention, the standard error correction model can be used for carrying out text error identification and text error correction on the text to be corrected.
Further, referring to fig. 4, the performing error correction on the text to be corrected by using the standard error correction model to obtain an error-corrected text includes the following steps S40 to S42:
s40, identifying the error probability of the text to be corrected by using a text error identification model in the standard error correction model;
s41, if the error probability is smaller than a preset error threshold value, the text to be corrected is not processed;
and S42, if the error probability is larger than or equal to the error threshold, performing text error correction on the text to be corrected by using a text error correction model in the standard error correction model to obtain an error-corrected text.
In an alternative embodiment of the present invention, the error threshold may be 0.72. Meanwhile, only the sentences with errors are identified to be corrected, so that correct sentences are prevented from being corrected, and the efficiency of text correction is improved.
S5, constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type.
In the embodiment of the present invention, the preset classification model may also be a BERT model.
In detail, the constructing an error correction pair based on the corrected text, and performing error correction type identification on the error correction pair by using a preset classification model to obtain an error correction type includes:
performing word segmentation processing on the corrected text and the text to be corrected corresponding to the corrected text;
and extracting words related to error correction after word segmentation to form an error correction pair, and outputting the error correction type of the error correction pair by using the classification model.
In an optional embodiment of the invention, the corrected text needs to be verified twice, the original sentence and the sentence corrected by the two-stage BERT are segmented, the corrected positions (different positions) of the two segmented sentences are compared, and the words related to the corrected positions are taken out, so that the specific part of the sentence which is corrected can be known. The extracted words are again input into the BERT model (i.e., classification model), and vectors are output via [ CLS ] of the BERT model. For example, a modification is obtained: original sentence words-error correction words, then the BERT input is: the [ CLS ] original sentence word [ sep ] error-correcting word [ sep ] and the [ CLS ] vector output by the BERT model can represent the relation characteristic vector of the original sentence word and the error-correcting word, and the [ CLS ] vector is input into softmax for class classification, so that the error correction type of the error correction pair can be obtained.
And S6, carrying out error correction verification on the error correction pair by utilizing an edit distance cost method based on the error correction type to obtain an error correction verification result.
In the embodiment of the present invention, the error correction types include a first error correction type and a second error correction type, where the first error correction type includes: harmonic word correction, such as with eye-glasses; and (3) correcting errors of the confusing sound words, such as wandering woven girls-cowherd woven girls, wherein a second error correction type comprises the following steps: the shape is like a word error correction, such as sorghum-sorghum.
In detail, referring to fig. 5, the performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result includes the following steps S60 to S62:
s60, if the error correction type is the first error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method of adjusting characters;
s61, if the error correction type is a second error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method of a keyboard level;
s62, determining that the error correction verification result of the error correction pair with the editing cost less than or equal to the preset cost threshold is error correction success, and determining that the error correction verification result of the error correction pair with the editing cost more than the preset cost threshold is error correction failure.
In the embodiment of the invention, for the first error correction type, the original sentence words and the error correction words are converted into pinyin, the pinyin is used for calculating the editing distance, and the character adjustment comprises character addition, character deletion, character replacement and the like. The method for adjusting the editing distance cost of the character and the method for adjusting the editing distance cost of the keyboard level are well known in the art and are not described herein again.
In an optional embodiment of the present invention, the preset cost threshold may be 1.
In the embodiment of the invention, two verification methods of the editing distance are used for carrying out secondary confirmation on the error correction part, thereby effectively improving the error correction capability and reducing the probability of correcting the error into another error. And the error parts of the sound errors and the shape errors are treated differently, and the error correction accuracy is improved. Because the calculation cost of the edit distance is high, if only the traditional edit distance method is used, the input is not words but sentences, which greatly improves the calculation cost.
According to the method, the template text is labeled according to the correctness of the text in the original error correction training text set, a standard error correction training text set with richer information can be obtained, the standard error correction training text set is used for carrying out combined training on a two-stage error correction model comprising a text error recognition model and a text error correction model, the standard error correction model obtained by training is used for recognizing and correcting the text to be corrected, and the accuracy and the efficiency of text error correction are improved. Meanwhile, an error correction pair is constructed based on the corrected text, and error correction verification is performed on the error correction pair by using an edit distance cost method. Therefore, the method for correcting and verifying the Chinese text can solve the problem of low efficiency of correcting the Chinese text.
Fig. 6 is a functional block diagram of a chinese text error correction and verification apparatus according to an embodiment of the present invention.
The apparatus 100 for correcting and verifying chinese text according to the present invention can be installed in an electronic device. According to the realized functions, the chinese text correction and verification device 100 may include a correction model training module 101, a text correction module 102, a correction type identification module 103, and a correction verification module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the error correction model training module 101 is configured to obtain an original error correction training text set, label a template text according to correctness of a text in the original error correction training text set to obtain a standard error correction training text set, construct a two-stage error correction model including a text error recognition model and a text error correction model, and perform joint training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
the text error correction module 102 is configured to obtain a text to be error corrected, and correct the text to be error corrected by using the standard error correction model to obtain an error corrected text;
the error correction type identification module 103 is configured to construct an error correction pair based on the error-corrected text, and perform error correction type identification on the error correction pair by using a preset classification model to obtain an error correction type;
the error correction verification module 104 is configured to perform error correction verification on the error correction pair by using an edit distance cost method based on the error correction type, so as to obtain an error correction verification result.
In detail, the embodiment of each module of the apparatus 100 for correcting and verifying chinese text is as follows:
the method comprises the steps of firstly, obtaining an original error correction training text set, and labeling a template text according to the correctness of texts in the original error correction training text set to obtain a standard error correction training text set.
In the embodiment of the present invention, the original error correction training text set is artificially labeled text data, and the format is that an incorrect sentence corresponds to a correct sentence, for example, "my living body is good-my body is good". The marking template text is constructed according to the correctness of the sentence, and the template text is 'the sentence is < >' and is used for distinguishing the correctness of the sentence.
In detail, the labeling template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set includes:
correctly labeling correct sentences in the original error correction training text set, and incorrectly labeling wrong sentences in the original error correction training text set;
summarizing all the labeled sentence texts to obtain the standard error correction training text set.
In an embodiment of the present invention, the template text is used to label different sentences, for example, the wrong sentence is labeled as "the sentence is wrong" and the correct sentence is labeled as "the sentence is right".
And step two, constructing a two-stage error correction model comprising a text error recognition model and a text error correction model.
In the embodiment of the present invention, the text error recognition model and the text error correction model are Mask Language Models (MLM) based on the BERT model, wherein the text error recognition model is used to recognize whether there are wrongly written or not in a sentence, and the text error correction model is used to correct a misspelled part.
In detail, the constructing a two-stage error correction model including a text error recognition model and a text error correction model includes:
acquiring a first BERT model, and splicing a full connection layer and an output layer behind the first BERT model to obtain the text error recognition model;
and acquiring a second BERT model as the text error correction model, and connecting the text error correction models of the text error recognition model machine in series to obtain the double-stage error correction model.
In the embodiment of the invention, the first BERT model is a traditional BERT model, the input is a text, the words in the text are separately cut out according to the character sequence, and the word vectors and the characteristic vectors of [ CLS ] can be output through the first BERT model; the full-connection layer consists of two MLPs (the structure of a single MLP network is shown in the upper diagram and consists of two linear layers and a ReLu activation function) full-connection networks and is used for further extracting features; the output layer comprises a sigmoid activation function used for calculating the probability of spelling error. The input to the second BERT model is a word vector of text and the output is a correct sentence.
And thirdly, performing joint training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model.
In the embodiment of the invention, because the text of the standard error correction training text set comprises the template text based on the text correctness label, the accuracy of error correction identification can be improved through the masking training of the BERT model.
In detail, the jointly training the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model, includes:
performing iterative training on the text error recognition model by using the standard error correction training text set;
outputting standard word vectors corresponding to sentences in the standard error correction training text set by using the trained text error recognition model;
copying and combining the standard word vectors, and performing attention training on the text error correction model based on the copied combined word vectors and a preset loss function;
and summarizing the trained text error recognition model and the trained text error correction model to obtain the double-stage error correction model.
In the embodiment of the present invention, the preset loss function may be a cross entropy loss function. After a standard word vector corresponding to a sentence in a standard error correction training text set is output by using a trained text error recognition model, the standard word vector is copied into two parts and used as Q and K in a second BERT model, then the result of g x w is used as V, wherein g represents the mean value of the word vector, w is a parameter matrix, the matrix size of w is LxD, D is 512 dimensions, and L is the number of words in an input text. And performing self-attention calculation in the second BERT model by using the < Q, K, V >, and performing iterative training on the second BERT model based on a cross entropy loss function to obtain a trained text error correction model.
Specifically, the iteratively training the text error recognition model by using the standard error correction training text set includes:
converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and performing masking treatment on preset positions in the word vectors to obtain masked word vectors;
extracting a standard word vector of the mask word vector by using the full connection layer, and outputting a predicted value of the standard word vector by using the output layer;
calculating a loss value based on the predicted value, if the loss value is greater than or equal to a preset loss threshold value, updating parameters in the first BERT model, returning to the step of converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and stopping training until the loss value is less than the preset loss threshold value to obtain a text error recognition model after training.
In an alternative embodiment of the present invention, the text error recognition model is actually a two-class classifier, for example: the 'my creature is good' is input into a first BERT model, an output mask word vector consists of three parts, the first part is a [ CLS ] vector, and the [ CLS ] is used for judging whether a sentence has spelling errors; the second is a word vector of each word, and "my creature is good" has 6 words, so a 6xD matrix is formed, each row in the matrix represents a word vector of a word, where D is 512 dimensions, an average value is calculated for each word vector to obtain g, and the g is multiplied by another matrix w, which is a learnable matrix, i.e. the values in the matrix can be updated through training the model each time iteration, and the matrix can reflect the highlighted value of each word after learning. Because the probability of the character which is possibly wrong is different for a sentence of input text, the ' my creature is good, and generally the ' me ' is not easy to make mistakes for the ' me ' and ' body ' and ' very ' and ' good ' relative to the ' body ', and the ' very ' character is easy to make mistakes for the ' me ' and ' body ' and ' good ', the w can learn the regular distribution through training data; the third is the masked template. That is, the masked word vector output by the first BERT model is: [ CLS ], my, birth, body, very, good, [ sep ], which, in words, < Masked >, [ seq ]. Compared with the traditional BERT model, the vector which is equivalent to [ CLS ] and [ Masked ] is spliced and then used as a new [ CLS ] vector.
In an optional embodiment of the invention, after the [ CLS ] feature vector (namely the mask word vector) is obtained, the [ CLS ] feature vector is fed into two MLP fully-connected networks, and the probability of spelling error is calculated by a sigmoid activation function, wherein the [ CLS ] feature vector passes through the output of the sigmoid function, and the output is a decimal number between 0 and 1 and represents the error probability. A threshold of 0.72 may be used, and if greater than 0.72, a spelling error is deemed to be present, and if less than this, it is not present. At the same time, a cross entropy loss function is used to calculate the loss value.
In an alternative embodiment of the present invention, the calculation of the cross entropy loss function is a conventional technique, and is not described herein again.
And step four, acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain the corrected text.
In the embodiment of the invention, the standard error correction model can be used for carrying out text error identification and text error correction on the text to be corrected.
Further, the performing error correction on the text to be corrected by using the standard error correction model to obtain an error-corrected text includes:
identifying the error probability of the text to be corrected by using a text error identification model in the standard error correction model;
if the error probability is smaller than a preset error threshold value, the text to be corrected is not processed;
and if the error probability is larger than or equal to the error threshold, performing text error correction on the text to be corrected by using a text error correction model in the standard error correction model to obtain an error-corrected text.
In an alternative embodiment of the present invention, the error threshold may be 0.72. Meanwhile, only the sentences with errors are identified to be corrected, so that correct sentences are prevented from being corrected, and the efficiency of text correction is improved.
And fifthly, constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type.
In the embodiment of the present invention, the preset classification model may also be a BERT model.
In detail, the constructing an error correction pair based on the corrected text, and performing error correction type identification on the error correction pair by using a preset classification model to obtain an error correction type includes:
performing word segmentation processing on the corrected text and the text to be corrected corresponding to the corrected text;
and extracting words related to error correction after word segmentation to form an error correction pair, and outputting the error correction type of the error correction pair by using the classification model.
In an optional embodiment of the invention, the corrected text needs to be verified twice, the original sentence and the sentence after the error correction of the two-stage BERT are segmented, the error correction positions (different positions) of the two sentences after the segmentation are compared, and the words related to the error correction positions are taken out, so that the specific part of the sentence after error correction can be known. The extracted words are again input into the BERT model (i.e., classification model), and vectors are output via [ CLS ] of the BERT model. For example, a modification is obtained: original sentence words-error correction words, then the BERT input is: and the [ CLS ] vector can represent the relation characteristic vector of the original sentence word and the error-correcting word, and the [ CLS ] vector is input into softmax for classification, so that the error-correcting type of the error-correcting pair can be obtained.
And sixthly, carrying out error correction verification on the error correction pair by utilizing an edit distance cost method based on the error correction type to obtain an error correction verification result.
In the embodiment of the present invention, the error correction types include a first error correction type and a second error correction type, where the first error correction type includes: harmonic word correction, such as with eye-glasses; and (3) correcting errors of the confusing sound words, such as wandering woven girls-cowherd woven girls, wherein a second error correction type comprises the following steps: the shape is like a word error correction, such as sorghum-sorghum.
In detail, the performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result includes:
if the error correction type is the first error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method of adjusting characters;
if the error correction type is a second error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method at a keyboard level;
and determining that the error correction verification result of the error correction pair with the editing cost less than or equal to the preset cost threshold is error correction success, and determining that the error correction verification result of the error correction pair with the editing cost more than the preset cost threshold is error correction failure.
In the embodiment of the invention, for the first error correction type, the original sentence words and the error correction words are converted into pinyin, the pinyin is used for calculating the editing distance, and the character adjustment comprises character addition, character deletion, character replacement and the like. The method for adjusting the editing distance cost of the character and the method for adjusting the editing distance cost of the keyboard level are well known in the art and are not described herein again.
In an optional embodiment of the present invention, the preset cost threshold may be 1.
In the embodiment of the invention, two verification methods of the editing distance are used for carrying out secondary confirmation on the error correction part, thereby effectively improving the error correction capability and reducing the probability of correcting the error into another error. And the error parts of sound errors and shape errors are treated differently, so that the error correction accuracy is improved. Because the calculation cost of the edit distance is high, if only the traditional edit distance method is used, the input is not words but sentences, which greatly improves the calculation cost.
According to the method, the template text is labeled according to the correctness of the text in the original error correction training text set, a standard error correction training text set with richer information can be obtained, the standard error correction training text set is used for carrying out combined training on a two-stage error correction model comprising a text error recognition model and a text error correction model, the standard error correction model obtained by training is used for recognizing and correcting the text to be corrected, and the accuracy and the efficiency of text error correction are improved. Meanwhile, an error correction pair is constructed based on the corrected text, and error correction verification is performed on the error correction pair by using an edit distance cost method. Therefore, the Chinese text error correction and verification device provided by the invention can solve the problem of low Chinese error correction efficiency.
Fig. 7 is a schematic structural diagram of an electronic device for implementing the method for correcting and verifying chinese text according to an embodiment of the present invention.
The electronic device may include a processor 10, a memory 11, a communication interface 12, and a bus 13, and may further include a computer program, such as a chinese text correction and verification program, stored in the memory 11 and operable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of chinese text correction and verification programs, etc., but also to temporarily store data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., chinese text error correction and verification programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication interface 12 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 13 may be divided into an address bus, a data bus, a control bus, etc. The bus 13 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 7 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 7 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The chinese text correction and verification program stored in the memory 11 of the electronic device is a combination of instructions, which when executed in the processor 10, can implement:
acquiring an original error correction training text set, and labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set;
constructing a two-stage error correction model comprising a text error recognition model and a text error correction model;
performing joint training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type;
and performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to the drawing, and is not repeated here.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring an original error correction training text set, and labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set;
constructing a two-stage error correction model comprising a text error recognition model and a text error correction model;
performing combined training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type;
and performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A method for correcting and verifying Chinese text is characterized by comprising the following steps:
acquiring an original error correction training text set, and labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set;
constructing a two-stage error correction model comprising a text error recognition model and a text error correction model;
performing joint training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
constructing an error correction pair based on the corrected text, and identifying the error correction type of the error correction pair by using a preset classification model to obtain the error correction type;
and performing error correction verification on the error correction pair by using an edit distance cost method based on the error correction type to obtain an error correction verification result.
2. The method of chinese text correction and verification of claim 1, wherein said constructing a two-stage correction model including a text error recognition model and a text correction model comprises:
acquiring a first BERT model, and splicing a full connection layer and an output layer behind the first BERT model to obtain the text error recognition model;
and acquiring a second BERT model as the text error correction model, and serially connecting the text error correction model of the text error recognition model machine to obtain the double-stage error correction model.
3. The method for correcting and verifying chinese text as recited in claim 2, wherein the jointly training the text error recognition model and the text error correction model using the standard error correction training text set to obtain a standard error correction model comprises:
performing iterative training on the text error recognition model by using the standard error correction training text set;
outputting a standard word vector corresponding to a sentence in the standard error correction training text set by using the trained text error recognition model;
copying and combining the standard word vectors, and performing attention training on the text error correction model based on the copied combined word vectors and a preset loss function;
and summarizing the trained text error recognition model and the trained text error correction model to obtain the double-stage error correction model.
4. The method of claim 3, wherein the iteratively training the text error recognition model using the standard error correction training text set comprises:
converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and performing covering treatment on preset positions in the word vectors to obtain covered word vectors;
extracting a standard word vector of the mask word vector by using the full connection layer, and outputting a predicted value of the standard word vector by using the output layer;
calculating a loss value based on the predicted value, if the loss value is greater than or equal to a preset loss threshold value, updating parameters in the first BERT model, returning to the step of converting sentences in the standard error correction training text set into word vectors by using the first BERT model, and stopping training until the loss value is less than the preset loss threshold value to obtain a text error recognition model after training.
5. The method for correcting and verifying Chinese text according to claim 1, wherein the correcting the text to be corrected by using the standard error correction model to obtain the corrected text comprises:
identifying the error probability of the text to be corrected by using a text error identification model in the standard error correction model;
if the error probability is smaller than a preset error threshold value, the text to be corrected is not processed;
and if the error probability is larger than or equal to the error threshold, performing text error correction on the text to be corrected by using a text error correction model in the standard error correction model to obtain an error-corrected text.
6. The method for Chinese text error correction and verification according to claim 1, wherein the constructing an error correction pair based on the corrected text, and performing error correction type recognition on the error correction pair by using a preset classification model to obtain an error correction type comprises:
performing word segmentation processing on the corrected text and the text to be corrected corresponding to the corrected text;
and after word segmentation is extracted, words related to error correction form an error correction pair, and the classification model is utilized to output the error correction type of the error correction pair.
7. The method for correcting and verifying Chinese text according to any of claims 1-6, wherein the correcting and verifying the error of the error correction pair by using an edit distance cost method based on the type of the error correction to obtain a result of the correcting and verifying comprises:
if the error correction type is a first error correction type, calculating the editing cost of the error correction pair by using an editing distance cost method for adjusting characters;
if the error correction type is a second error correction type, calculating the editing cost of the error correction pair by using a keyboard-level editing distance cost method;
and determining that the error correction verification result of the error correction pair with the editing cost less than or equal to the preset cost threshold is error correction success, and determining that the error correction verification result of the error correction pair with the editing cost more than the preset cost threshold is error correction failure.
8. A Chinese text correction and verification device, the device comprising:
the error correction model training module is used for acquiring an original error correction training text set, labeling a template text according to the correctness of the text in the original error correction training text set to obtain a standard error correction training text set, constructing a two-stage error correction model comprising a text error recognition model and a text error correction model, and performing combined training on the text error recognition model and the text error correction model by using the standard error correction training text set to obtain a standard error correction model;
the text error correction module is used for acquiring a text to be corrected, and correcting the text to be corrected by using the standard error correction model to obtain an error-corrected text;
the error correction type identification module is used for constructing an error correction pair based on the error corrected text and identifying the error correction type of the error correction pair by utilizing a preset classification model to obtain the error correction type;
and the error correction verification module is used for performing error correction verification on the error correction pair by utilizing an edit distance cost method based on the error correction type to obtain an error correction verification result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the chinese text correction and verification method according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the chinese text correction and verification method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210824618.4A CN115169330B (en) | 2022-07-13 | 2022-07-13 | Chinese text error correction and verification method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210824618.4A CN115169330B (en) | 2022-07-13 | 2022-07-13 | Chinese text error correction and verification method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115169330A true CN115169330A (en) | 2022-10-11 |
CN115169330B CN115169330B (en) | 2023-05-02 |
Family
ID=83493177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210824618.4A Active CN115169330B (en) | 2022-07-13 | 2022-07-13 | Chinese text error correction and verification method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115169330B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118520870A (en) * | 2024-07-24 | 2024-08-20 | 北京匠数科技有限公司 | Text error correction method, device, computer equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN111626047A (en) * | 2020-04-23 | 2020-09-04 | 平安科技(深圳)有限公司 | Intelligent text error correction method and device, electronic equipment and readable storage medium |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
CN112329476A (en) * | 2020-11-11 | 2021-02-05 | 北京京东尚科信息技术有限公司 | Text error correction method and device, equipment and storage medium |
CN112836496A (en) * | 2021-01-25 | 2021-05-25 | 之江实验室 | Text error correction method based on BERT and feedforward neural network |
CN113177405A (en) * | 2021-05-28 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Method, device and equipment for correcting data errors based on BERT and storage medium |
US20210271810A1 (en) * | 2020-03-02 | 2021-09-02 | Grammarly Inc. | Proficiency and native language-adapted grammatical error correction |
CN113807973A (en) * | 2021-09-16 | 2021-12-17 | 平安科技(深圳)有限公司 | Text error correction method and device, electronic equipment and computer readable storage medium |
CN113887200A (en) * | 2021-09-29 | 2022-01-04 | 平安银行股份有限公司 | Text variable-length error correction method and device, electronic equipment and storage medium |
CN114154486A (en) * | 2021-11-09 | 2022-03-08 | 浙江大学 | Intelligent error correction system for Chinese corpus spelling errors |
CN114417834A (en) * | 2021-12-24 | 2022-04-29 | 深圳云天励飞技术股份有限公司 | Text processing method and device, electronic equipment and readable storage medium |
-
2022
- 2022-07-13 CN CN202210824618.4A patent/CN115169330B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
US20210271810A1 (en) * | 2020-03-02 | 2021-09-02 | Grammarly Inc. | Proficiency and native language-adapted grammatical error correction |
CN111626047A (en) * | 2020-04-23 | 2020-09-04 | 平安科技(深圳)有限公司 | Intelligent text error correction method and device, electronic equipment and readable storage medium |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
CN112329476A (en) * | 2020-11-11 | 2021-02-05 | 北京京东尚科信息技术有限公司 | Text error correction method and device, equipment and storage medium |
CN112836496A (en) * | 2021-01-25 | 2021-05-25 | 之江实验室 | Text error correction method based on BERT and feedforward neural network |
CN113177405A (en) * | 2021-05-28 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Method, device and equipment for correcting data errors based on BERT and storage medium |
CN113807973A (en) * | 2021-09-16 | 2021-12-17 | 平安科技(深圳)有限公司 | Text error correction method and device, electronic equipment and computer readable storage medium |
CN113887200A (en) * | 2021-09-29 | 2022-01-04 | 平安银行股份有限公司 | Text variable-length error correction method and device, electronic equipment and storage medium |
CN114154486A (en) * | 2021-11-09 | 2022-03-08 | 浙江大学 | Intelligent error correction system for Chinese corpus spelling errors |
CN114417834A (en) * | 2021-12-24 | 2022-04-29 | 深圳云天励飞技术股份有限公司 | Text processing method and device, electronic equipment and readable storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118520870A (en) * | 2024-07-24 | 2024-08-20 | 北京匠数科技有限公司 | Text error correction method, device, computer equipment and storage medium |
CN118520870B (en) * | 2024-07-24 | 2024-09-27 | 北京匠数科技有限公司 | Text error correction method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115169330B (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110795938B (en) | Text sequence word segmentation method, device and storage medium | |
CN113704429A (en) | Semi-supervised learning-based intention identification method, device, equipment and medium | |
CN114822812A (en) | Character dialogue simulation method, device, equipment and storage medium | |
CN112380343A (en) | Problem analysis method, problem analysis device, electronic device and storage medium | |
CN113360654B (en) | Text classification method, apparatus, electronic device and readable storage medium | |
CN114511038A (en) | False news detection method and device, electronic equipment and readable storage medium | |
CN113378970A (en) | Sentence similarity detection method and device, electronic equipment and storage medium | |
CN112528013A (en) | Text abstract extraction method and device, electronic equipment and storage medium | |
CN113807973A (en) | Text error correction method and device, electronic equipment and computer readable storage medium | |
CN113704393A (en) | Keyword extraction method, device, equipment and medium | |
CN112528633A (en) | Text error correction method and device, electronic equipment and computer readable storage medium | |
CN115238115A (en) | Image retrieval method, device and equipment based on Chinese data and storage medium | |
CN114610855A (en) | Dialog reply generation method and device, electronic equipment and storage medium | |
CN115169330B (en) | Chinese text error correction and verification method, device, equipment and storage medium | |
CN113658002A (en) | Decision tree-based transaction result generation method and device, electronic equipment and medium | |
CN114138243A (en) | Function calling method, device, equipment and storage medium based on development platform | |
CN116468025A (en) | Electronic medical record structuring method and device, electronic equipment and storage medium | |
CN116702761A (en) | Text error correction method, device, equipment and storage medium | |
CN115346095A (en) | Visual question answering method, device, equipment and storage medium | |
CN114757154A (en) | Job generation method, device and equipment based on deep learning and storage medium | |
CN112346737B (en) | Method, device and equipment for training programming language translation model and storage medium | |
CN115221274A (en) | Text emotion classification method and device, electronic equipment and storage medium | |
CN115116069A (en) | Text processing method and device, electronic equipment and storage medium | |
CN114372467A (en) | Named entity extraction method and device, electronic equipment and storage medium | |
CN113434650A (en) | Question and answer pair expansion method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |