CN112287670A - Text error correction method, system, computer device and readable storage medium - Google Patents
Text error correction method, system, computer device and readable storage medium Download PDFInfo
- Publication number
- CN112287670A CN112287670A CN202011293207.4A CN202011293207A CN112287670A CN 112287670 A CN112287670 A CN 112287670A CN 202011293207 A CN202011293207 A CN 202011293207A CN 112287670 A CN112287670 A CN 112287670A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- error correction
- masked
- soft
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 128
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 123
- 238000001914 filtration Methods 0.000 claims abstract description 36
- 238000005457 optimization Methods 0.000 claims abstract description 15
- 238000010276 construction Methods 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 38
- 238000004590 computer program Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 abstract description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000001514 detection method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 230000002829 reductive effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000000873 masking effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 208000011977 language disease Diseases 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a text error correction method, a system, a computer device and a computer readable storage medium, wherein the method comprises the following steps: a data acquisition step, which is used for acquiring text data to be corrected; a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample; a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results; and model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering. By the method and the device, the text error correction accuracy is effectively improved, and the model effect and performance are improved.
Description
Technical Field
The present application relates to the field of natural language processing, and more particularly, to text error correction methods, systems, computer devices, and computer readable storage media.
Background
The Chinese error correction technology is an important technology for realizing automatic check and automatic error correction of Chinese sentences, and aims to improve the language correctness and reduce the manual check cost. The Chinese character error correction technology mainly corrects the errors according to the similarity of character patterns, is a technology for detecting whether a section of characters have wrongly written characters or not and correcting the wrongly written characters in the field of natural language processing, is generally used in a text preprocessing stage, and can remarkably solve the problem that information acquisition is inaccurate in scenes such as intelligent customer service and the like, for example, wrongly written characters in certain intelligent question and answer scenes can influence query understanding and conversation effects. In the general field, the problem of Chinese text correction is a problem that is sought to be solved from the beginning of the Internet. In a search engine, a good error correction system can perform error correction prompting on a query word input by a user or directly display a correct answer.
In the prior art, text correction is realized by the following tools: (1) a wrongly written word dictionary is constructed; (2) editing distance, wherein the editing distance adopts a method similar to fuzzy matching of character strings, and part of common wrongly written characters and language diseases can be corrected by contrasting correct samples; (3) language models, which may be word or word error correction granularity. In recent years, pre-trained language models have become popular, and researchers have found that BERT (Bidirectional Encoder representation from transforms) type models are migrated into text correction, and that BERT is used to select a character in a candidate list for correction at each position of a sentence.
Among the tools, the labor cost for constructing the wrongly written character dictionary is high, and the method is only suitable for the limited partial vertical field of wrongly written characters; the generality of the way of editing the distance is not sufficient; because the semantic information of the word granularity is relatively weak in the language model, the misjudgment rate is higher than the error correction of the word granularity; the word granularity depends on the accuracy of the word segmentation model, in order to reduce the misjudgment rate, a CRF layer is often added in an output layer of the model for proofreading, and unreasonable wrongly-written and wrongly-written word output is avoided through learning transition probability and a global optimal path. The BERT method is too rough and easily causes high false rate, the mask position of BERT is randomly selected, so that the method is not good at detecting the wrong position in the sentence, and BERT error correction does not consider the constraint condition, so that the accuracy is low.
Disclosure of Invention
The embodiment of the application provides a text error correction method, a text error correction system, computer equipment and a computer readable storage medium, combines Chinese character characteristics and pinyin characteristics of a text, trains by using a Soft-Masked BERT pre-training model, and optimizes the model by combining recursive prediction and vocabulary filtering, so that the error correction accuracy is improved, and the model effect and performance are improved.
In a first aspect, an embodiment of the present application provides a text error correction method, including:
a data acquisition step, which is used for acquiring text data to be corrected;
and a negative sample construction step, configured to create a confusion word table, perform corpus replacement on the text to be corrected according to the confusion word table, and generate a negative sample, specifically, perform random replacement on 15% of characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And text error correction, namely using the text data to be corrected and the negative sample data as training data, splicing the Chinese character characteristics and the pinyin characteristics of the training data after being trained by a Soft-Masked BERT pre-training model respectively to obtain a training result, and calculating cross entropy loss of the training result through a Softmax layer to obtain an error correction result, wherein the Soft-Masked BERT pre-training model is a newly proposed neural network structure and comprises a detection network and a BERT-based correction network, and the Softmax layer is a function in machine learning and is used for mapping input to real numbers between 0 and 1, and the normalization ensures that the sum is 1.
In some embodiments, the text error correction method further includes a model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to add constraints to model training and further improve error correction accuracy and model performance.
Through the steps, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some of these embodiments, the text correction step further comprises:
a word vector obtaining step, which is used for respectively training the Chinese character characteristics and the pinyin characteristics through a Soft-Masked BERT pre-training model to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
In some of these embodiments, the model optimization step further comprises:
a recursive prediction step, configured to input the error-corrected sentence obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction, specifically, the recursive prediction step includes at least one recursive process, and if the results of two recursive predictions are the same, the recursive correction step is stopped;
and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the word list filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the word list are filtered.
In the embodiment, the Soft-Masked BERT pre-training model is used for one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. And through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In a second aspect, an embodiment of the present application provides a text correction system, including:
the data acquisition module is used for acquiring text data to be corrected;
and the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, performing random replacement on 15% of characters in the text to be corrected, further, 80% of the characters which are randomly replaced use homophone characters in the confusion word table to replace, and the rest 20% of the characters use random characters to replace. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And the text error correction module is used for taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and splicing the training results into training results, and calculating the cross entropy loss of the training results through a Softmax layer to obtain error correction results.
In some embodiments, the text error correction system further includes a model optimization module, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to further improve the error correction accuracy and the model performance.
Through the modules, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some embodiments, the text correction module further comprises:
the word vector acquisition module is used for respectively carrying out Soft-Masked BERT pre-training model training on the Chinese character characteristics and the pinyin characteristics to obtain word vectors of Chinese characters and word vectors of pinyin;
and the cross entropy loss acquisition module is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
In some of these embodiments, the model optimization module further comprises:
the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive error correction module is stopped;
and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
In the embodiment, the Soft-Masked BERT pre-training model is used for one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. And through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the text error correction method according to the first aspect is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the text error correction method according to the first aspect.
Compared with the prior art, the text error correction method, the text error correction system, the computer equipment and the computer readable storage medium provided by the embodiment of the application improve the error correction effect by using the Soft-Masked BERT pre-training model compared with the original BERT model; in addition, the effect and the performance of the model are obviously improved and the correction accuracy is improved through recursive error correction and vocabulary filtering.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram of a text error correction method according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating the structure of a text correction system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a text correction step according to a preferred embodiment of the present application;
FIG. 4 is a schematic diagram of the principle of the Soft-Masked BERT pre-training model according to the preferred embodiment of the present application.
Description of the drawings:
10. a data acquisition module; 20. a negative sample construction module; 30. a text error correction module;
40. a model optimization module;
301. a word vector acquisition module; 302. a cross entropy loss acquisition module;
401. a recursive prediction module; 402. and a word list filtering module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a text error correction method. Fig. 1 is a schematic flow chart of a text error correction method according to an embodiment of the present application, and referring to fig. 1, the flow chart includes the following steps:
a data obtaining step S10, configured to obtain text data to be corrected;
a negative sample construction step S20, configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, perform random replacement on 15% of the characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And a text error correction step S30, wherein the text data to be error corrected and the negative sample data are used as training data, the Chinese character characteristics and the pinyin characteristics of the training data are respectively trained by a Soft-Masked BERT pre-training model and then spliced into a training result, and the training result is subjected to cross entropy loss calculation by a Softmax layer to obtain an error correction result.
And a model optimization step S40, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, adding constraints for model training, and further improving the error correction accuracy and the performance of the model.
Through the steps, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some of these embodiments, the text correction step S30 further includes:
a word vector obtaining step S301, for respectively performing Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step S302, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer, and outputting an error correction result.
In some of these embodiments, the model optimization step S40 further includes:
a recursive prediction step S401, configured to input the sentence subjected to error correction in step S30 into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of two recursive predictions are the same, stop the recursive error correction step;
and a vocabulary filtering step S402, which is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to enable the number of search words in the Soft-Masked BERT pre-training model to be less than or equal to 1000 when the error is corrected, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
The embodiment considers that the Soft-Masked BERT pre-training model is one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. Through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
The embodiments of the present application are described and illustrated below by means of preferred embodiments.
Fig. 3 is a schematic diagram of a principle of a text error correction step according to a preferred embodiment of the present application, fig. 4 is a schematic diagram of a Soft-Masked BERT pre-training model according to the preferred embodiment of the present application, and with reference to fig. 1, fig. 3, and fig. 4, in this embodiment, text data to be error corrected and negative sample data obtained through steps S10 and S20 are used as training data, chinese character features x1, x2, x3, and x4 of the training data are input to the Soft-Masked BERT pre-training model to perform chinese character Soft-Masked BERT to obtain word vectors of chinese characters, and pinyin features x1 ', x 2', x3 ', and x 4' are word vectors obtained by performing pinyin Soft-Masked BERT retraining through the Soft-Masked BERT pre-training model to obtain pinyin word vectors, and the word vectors are spliced through a Softmax layer to calculate cross entropy loss.
The Soft-Masked BERT pre-training model shown in fig. 4 specifically includes a Detection Network (Detection Network) for predicting the probability that a character is a wrongly written character and a Correction Network (Correction Network) for predicting the probability that an error is corrected, and specifically, the Detection Network is composed of a bidirectional GRU Network (Bi-GRU for short), and is used for sufficiently learning input context information and outputting the probability that each position i may be a wrongly written character, that is, p in the graphiThe larger the value is, the position is shownThe greater the probability of error, where i is a natural number. Next, detecting a Soft Masking part between the network and the correction network, and representing the characteristic of each position by piMultiplying the probability of (1) by the characteristics of the masking character by 1-piThe probability of (2) is multiplied by the original input word vector feature, and the last two parts are added as the feature of each character, and the above process can be expressed as:
ei′=pi·emask+(1-pi)·ei,
in the formula, ei' feature of each character output, emaskAs a feature of the masking character, eiFor input of word vector features, piThe probability that the ith character is a wrongly written character is shown, and i is a natural number.
Mixing the above ei' inputting into a correction network, the correction network is a BERT-based sequence multi-classification label model, and the final characteristic representation of each character obtained by the correction network is the output of the last layer and the word vector characteristic eiThe loss function is weighted by the detection network and the correction network, via a representation of the residual connection.
For example and without limitation, if the text data to be corrected is "machine learning is a known part of artificial intelligence", the sentence and the negative sample are input into a Soft-Masked BERT pre-training model, the probability that each position is possibly a wrong character is output by a detection network, the "known" is the wrong character, the output probability is higher, the Soft-Masked of the Soft-Masked BERT pre-training model is a Soft-Masking part, the probability that each position is wrong is combined with Masking character characteristics and output to a correction network, the correction network outputs the character with the maximum probability from the word list candidate words as a correct character, and the text to be corrected is "known" to be corrected, so that text correction is realized.
It should be noted that the steps shown in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The embodiment also provides a text error correction system. Fig. 2 is a block diagram of a structure of a text correction system according to an embodiment of the present application. As shown in fig. 2, the text correction system includes: the system comprises a data acquisition module 10, a negative sample construction module 20, a text error correction module 30, a model optimization module 40 and the like. Those skilled in the art will appreciate that the configuration of the text correction system shown in FIG. 2 does not constitute a limitation of the text correction system, and may include more or fewer modules than shown, or some modules in combination, or a different arrangement of modules.
The following describes each constituent module of the text correction system in detail with reference to fig. 2:
the data acquisition module 10 is used for acquiring text data to be corrected;
the negative sample construction module 20 is configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, 15% of the characters in the text to be corrected are randomly replaced, further, 80% of the characters in the 15% randomly replaced characters are replaced with homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
The text error correction module 30 is used for taking text data to be corrected and negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and then splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results;
and the model optimization module 40 is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, and further improving the error correction accuracy and the model performance.
Wherein the text error correction module 30 further comprises: the word vector acquisition module 301 is configured to perform Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features respectively to obtain word vectors of the Chinese characters and word vectors of the pinyin; the cross entropy loss obtaining module 302 is configured to splice word vectors of the chinese characters and word vectors of the pinyin, calculate a sum of cross entropy losses at each position in the training data through a Softmax layer, and output an error correction result. The model optimization module 40 further comprises: the recursive prediction module 401 is configured to input the error-corrected sentence into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive correction module is stopped; the vocabulary filtering module 402 is configured to filter a vocabulary in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000, where the filtering degree of the vocabulary filtering may be adjusted, and preferably, words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
The embodiment considers that the Soft-Masked BERT pre-training model is one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. Through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In addition, the text error correction method described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.
In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.
The processor may read and execute the computer program instructions stored in the memory to implement any of the text error correction methods in the above embodiments.
In addition, in combination with the text error correction method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the text correction methods in the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A text error correction method, comprising:
a data acquisition step, which is used for acquiring text data to be corrected;
a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;
and a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss through a Softmax layer to obtain error correction results.
2. The text error correction method according to claim 1, further comprising a model optimization step for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.
3. The text correction method of claim 1, wherein the text correction step further comprises:
a word vector obtaining step, which is used for respectively training the Chinese character characteristics and the pinyin characteristics through a Soft-Masked BERT pre-training model to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
4. The text correction method of claim 2 wherein the model optimization step further comprises:
a recursive prediction step, which is used for inputting the error-corrected sentences obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction;
and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000 when the error is corrected.
5. A text correction system, comprising:
the data acquisition module is used for acquiring text data to be corrected;
the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;
and the text error correction module is used for taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and splicing the training results into training results, and calculating the cross entropy loss of the training results through a Softmax layer to obtain error correction results.
6. The text correction system of claim 5, further comprising a model optimization module for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.
7. The text correction system of claim 5, wherein the text correction module further comprises:
the word vector acquisition module is used for respectively carrying out Soft-Masked BERT pre-training model training on the Chinese character characteristics and the pinyin characteristics to obtain word vectors of Chinese characters and word vectors of pinyin;
and the cross entropy loss acquisition module is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
8. The text correction system of claim 6, wherein the model optimization module further comprises:
the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction;
and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words is less than or equal to 1000 when the Soft-Masked BERT pre-training model is corrected.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the text correction method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text correction method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011293207.4A CN112287670A (en) | 2020-11-18 | 2020-11-18 | Text error correction method, system, computer device and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011293207.4A CN112287670A (en) | 2020-11-18 | 2020-11-18 | Text error correction method, system, computer device and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112287670A true CN112287670A (en) | 2021-01-29 |
Family
ID=74398422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011293207.4A Pending CN112287670A (en) | 2020-11-18 | 2020-11-18 | Text error correction method, system, computer device and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112287670A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560451A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
CN113011126A (en) * | 2021-03-11 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN113051894A (en) * | 2021-03-16 | 2021-06-29 | 京东数字科技控股股份有限公司 | Text error correction method and device |
CN113065339A (en) * | 2021-04-12 | 2021-07-02 | 平安国际智慧城市科技股份有限公司 | Automatic error correction method, device and equipment for Chinese text and storage medium |
CN113449514A (en) * | 2021-06-21 | 2021-09-28 | 浙江康旭科技有限公司 | Text error correction method and device suitable for specific vertical field |
CN114023306A (en) * | 2022-01-04 | 2022-02-08 | 阿里云计算有限公司 | Processing method for pre-training language model and spoken language understanding system |
CN114239559A (en) * | 2021-11-15 | 2022-03-25 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for generating text error correction and text error correction model |
CN114676684A (en) * | 2022-03-17 | 2022-06-28 | 平安科技(深圳)有限公司 | Text error correction method and device, computer equipment and storage medium |
CN114970502A (en) * | 2021-12-29 | 2022-08-30 | 中科大数据研究院 | Text error correction method applied to digital government |
CN115455948A (en) * | 2022-11-11 | 2022-12-09 | 北京澜舟科技有限公司 | Spelling error correction model training method, spelling error correction method and storage medium |
CN115630634A (en) * | 2022-12-08 | 2023-01-20 | 深圳依时货拉拉科技有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
WO2023093525A1 (en) * | 2021-11-23 | 2023-06-01 | 中兴通讯股份有限公司 | Model training method, chinese text error correction method, electronic device, and storage medium |
WO2023184633A1 (en) * | 2022-03-31 | 2023-10-05 | 上海蜜度信息技术有限公司 | Chinese spelling error correction method and system, storage medium, and terminal |
WO2024045527A1 (en) * | 2022-09-02 | 2024-03-07 | 美的集团(上海)有限公司 | Word/sentence error correction method and device, readable storage medium, and computer program product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015075706A (en) * | 2013-10-10 | 2015-04-20 | 日本放送協会 | Error correction model learning device and program |
CN108874174A (en) * | 2018-05-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | A kind of text error correction method, device and relevant device |
CN110046350A (en) * | 2019-04-12 | 2019-07-23 | 百度在线网络技术(北京)有限公司 | Grammatical bloopers recognition methods, device, computer equipment and storage medium |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110489760A (en) * | 2019-09-17 | 2019-11-22 | 达而观信息科技(上海)有限公司 | Based on deep neural network text auto-collation and device |
CN111523306A (en) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Text error correction method, device and system |
-
2020
- 2020-11-18 CN CN202011293207.4A patent/CN112287670A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015075706A (en) * | 2013-10-10 | 2015-04-20 | 日本放送協会 | Error correction model learning device and program |
CN108874174A (en) * | 2018-05-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | A kind of text error correction method, device and relevant device |
CN111523306A (en) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Text error correction method, device and system |
CN110046350A (en) * | 2019-04-12 | 2019-07-23 | 百度在线网络技术(北京)有限公司 | Grammatical bloopers recognition methods, device, computer equipment and storage medium |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110489760A (en) * | 2019-09-17 | 2019-11-22 | 达而观信息科技(上海)有限公司 | Based on deep neural network text auto-collation and device |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560451B (en) * | 2021-02-20 | 2021-05-14 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
CN112560451A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
CN113011126B (en) * | 2021-03-11 | 2023-06-30 | 腾讯科技(深圳)有限公司 | Text processing method, text processing device, electronic equipment and computer readable storage medium |
CN113011126A (en) * | 2021-03-11 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN113051894A (en) * | 2021-03-16 | 2021-06-29 | 京东数字科技控股股份有限公司 | Text error correction method and device |
CN113065339A (en) * | 2021-04-12 | 2021-07-02 | 平安国际智慧城市科技股份有限公司 | Automatic error correction method, device and equipment for Chinese text and storage medium |
CN113449514A (en) * | 2021-06-21 | 2021-09-28 | 浙江康旭科技有限公司 | Text error correction method and device suitable for specific vertical field |
CN113449514B (en) * | 2021-06-21 | 2023-10-31 | 浙江康旭科技有限公司 | Text error correction method and device suitable for vertical field |
CN114239559A (en) * | 2021-11-15 | 2022-03-25 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for generating text error correction and text error correction model |
CN114239559B (en) * | 2021-11-15 | 2023-07-11 | 北京百度网讯科技有限公司 | Text error correction and text error correction model generation method, device, equipment and medium |
WO2023093525A1 (en) * | 2021-11-23 | 2023-06-01 | 中兴通讯股份有限公司 | Model training method, chinese text error correction method, electronic device, and storage medium |
CN114970502B (en) * | 2021-12-29 | 2023-03-28 | 中科大数据研究院 | Text error correction method applied to digital government |
CN114970502A (en) * | 2021-12-29 | 2022-08-30 | 中科大数据研究院 | Text error correction method applied to digital government |
CN114023306A (en) * | 2022-01-04 | 2022-02-08 | 阿里云计算有限公司 | Processing method for pre-training language model and spoken language understanding system |
CN114676684A (en) * | 2022-03-17 | 2022-06-28 | 平安科技(深圳)有限公司 | Text error correction method and device, computer equipment and storage medium |
CN114676684B (en) * | 2022-03-17 | 2024-02-02 | 平安科技(深圳)有限公司 | Text error correction method and device, computer equipment and storage medium |
WO2023184633A1 (en) * | 2022-03-31 | 2023-10-05 | 上海蜜度信息技术有限公司 | Chinese spelling error correction method and system, storage medium, and terminal |
WO2024045527A1 (en) * | 2022-09-02 | 2024-03-07 | 美的集团(上海)有限公司 | Word/sentence error correction method and device, readable storage medium, and computer program product |
CN115455948A (en) * | 2022-11-11 | 2022-12-09 | 北京澜舟科技有限公司 | Spelling error correction model training method, spelling error correction method and storage medium |
CN115630634A (en) * | 2022-12-08 | 2023-01-20 | 深圳依时货拉拉科技有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112287670A (en) | Text error correction method, system, computer device and readable storage medium | |
CN109885660B (en) | Knowledge graph energizing question-answering system and method based on information retrieval | |
CN109840287B (en) | Cross-modal information retrieval method and device based on neural network | |
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN111159412B (en) | Classification method, classification device, electronic equipment and readable storage medium | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN109271524B (en) | Entity linking method in knowledge base question-answering system | |
CN111599340A (en) | Polyphone pronunciation prediction method and device and computer readable storage medium | |
CN110825857A (en) | Multi-turn question and answer identification method and device, computer equipment and storage medium | |
US11934781B2 (en) | Systems and methods for controllable text summarization | |
CN115328756A (en) | Test case generation method, device and equipment | |
CN112199473A (en) | Multi-turn dialogue method and device in knowledge question-answering system | |
CN110866095A (en) | Text similarity determination method and related equipment | |
CN113158687B (en) | Semantic disambiguation method and device, storage medium and electronic device | |
CN110929532B (en) | Data processing method, device, equipment and storage medium | |
CN112307048A (en) | Semantic matching model training method, matching device, equipment and storage medium | |
CN113779190B (en) | Event causal relationship identification method, device, electronic equipment and storage medium | |
CN111611791B (en) | Text processing method and related device | |
CN113705207A (en) | Grammar error recognition method and device | |
CN112632956A (en) | Text matching method, device, terminal and storage medium | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN110162615A (en) | A kind of intelligent answer method, apparatus, electronic equipment and storage medium | |
CN114239555A (en) | Training method of keyword extraction model and related device | |
CN113128224B (en) | Chinese error correction method, device, equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |