CN112287670A

CN112287670A - Text error correction method, system, computer device and readable storage medium

Info

Publication number: CN112287670A
Application number: CN202011293207.4A
Authority: CN
Inventors: 陈倩倩; 景艳山; 郑悦
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-01-29

Abstract

The application relates to a text error correction method, a system, a computer device and a computer readable storage medium, wherein the method comprises the following steps: a data acquisition step, which is used for acquiring text data to be corrected; a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample; a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results; and model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering. By the method and the device, the text error correction accuracy is effectively improved, and the model effect and performance are improved.

Description

Text error correction method, system, computer device and readable storage medium

Technical Field

The present application relates to the field of natural language processing, and more particularly, to text error correction methods, systems, computer devices, and computer readable storage media.

Background

The Chinese error correction technology is an important technology for realizing automatic check and automatic error correction of Chinese sentences, and aims to improve the language correctness and reduce the manual check cost. The Chinese character error correction technology mainly corrects the errors according to the similarity of character patterns, is a technology for detecting whether a section of characters have wrongly written characters or not and correcting the wrongly written characters in the field of natural language processing, is generally used in a text preprocessing stage, and can remarkably solve the problem that information acquisition is inaccurate in scenes such as intelligent customer service and the like, for example, wrongly written characters in certain intelligent question and answer scenes can influence query understanding and conversation effects. In the general field, the problem of Chinese text correction is a problem that is sought to be solved from the beginning of the Internet. In a search engine, a good error correction system can perform error correction prompting on a query word input by a user or directly display a correct answer.

In the prior art, text correction is realized by the following tools: (1) a wrongly written word dictionary is constructed; (2) editing distance, wherein the editing distance adopts a method similar to fuzzy matching of character strings, and part of common wrongly written characters and language diseases can be corrected by contrasting correct samples; (3) language models, which may be word or word error correction granularity. In recent years, pre-trained language models have become popular, and researchers have found that BERT (Bidirectional Encoder representation from transforms) type models are migrated into text correction, and that BERT is used to select a character in a candidate list for correction at each position of a sentence.

Among the tools, the labor cost for constructing the wrongly written character dictionary is high, and the method is only suitable for the limited partial vertical field of wrongly written characters; the generality of the way of editing the distance is not sufficient; because the semantic information of the word granularity is relatively weak in the language model, the misjudgment rate is higher than the error correction of the word granularity; the word granularity depends on the accuracy of the word segmentation model, in order to reduce the misjudgment rate, a CRF layer is often added in an output layer of the model for proofreading, and unreasonable wrongly-written and wrongly-written word output is avoided through learning transition probability and a global optimal path. The BERT method is too rough and easily causes high false rate, the mask position of BERT is randomly selected, so that the method is not good at detecting the wrong position in the sentence, and BERT error correction does not consider the constraint condition, so that the accuracy is low.

Disclosure of Invention

The embodiment of the application provides a text error correction method, a text error correction system, computer equipment and a computer readable storage medium, combines Chinese character characteristics and pinyin characteristics of a text, trains by using a Soft-Masked BERT pre-training model, and optimizes the model by combining recursive prediction and vocabulary filtering, so that the error correction accuracy is improved, and the model effect and performance are improved.

In a first aspect, an embodiment of the present application provides a text error correction method, including:

a data acquisition step, which is used for acquiring text data to be corrected;

and a negative sample construction step, configured to create a confusion word table, perform corpus replacement on the text to be corrected according to the confusion word table, and generate a negative sample, specifically, perform random replacement on 15% of characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.

And text error correction, namely using the text data to be corrected and the negative sample data as training data, splicing the Chinese character characteristics and the pinyin characteristics of the training data after being trained by a Soft-Masked BERT pre-training model respectively to obtain a training result, and calculating cross entropy loss of the training result through a Softmax layer to obtain an error correction result, wherein the Soft-Masked BERT pre-training model is a newly proposed neural network structure and comprises a detection network and a BERT-based correction network, and the Softmax layer is a function in machine learning and is used for mapping input to real numbers between 0 and 1, and the normalization ensures that the sum is 1.

In some embodiments, the text error correction method further includes a model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to add constraints to model training and further improve error correction accuracy and model performance.

Through the steps, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.

In some of these embodiments, the text correction step further comprises:

a word vector obtaining step, which is used for respectively training the Chinese character characteristics and the pinyin characteristics through a Soft-Masked BERT pre-training model to obtain word vectors of Chinese characters and word vectors of pinyin;

and a cross entropy loss obtaining step, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.

In some of these embodiments, the model optimization step further comprises:

a recursive prediction step, configured to input the error-corrected sentence obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction, specifically, the recursive prediction step includes at least one recursive process, and if the results of two recursive predictions are the same, the recursive correction step is stopped;

and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the word list filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the word list are filtered.

In the embodiment, the Soft-Masked BERT pre-training model is used for one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. And through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.

In a second aspect, an embodiment of the present application provides a text correction system, including:

the data acquisition module is used for acquiring text data to be corrected;

and the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, performing random replacement on 15% of characters in the text to be corrected, further, 80% of the characters which are randomly replaced use homophone characters in the confusion word table to replace, and the rest 20% of the characters use random characters to replace. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.

And the text error correction module is used for taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and splicing the training results into training results, and calculating the cross entropy loss of the training results through a Softmax layer to obtain error correction results.

In some embodiments, the text error correction system further includes a model optimization module, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to further improve the error correction accuracy and the model performance.

Through the modules, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.

In some embodiments, the text correction module further comprises:

the word vector acquisition module is used for respectively carrying out Soft-Masked BERT pre-training model training on the Chinese character characteristics and the pinyin characteristics to obtain word vectors of Chinese characters and word vectors of pinyin;

and the cross entropy loss acquisition module is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.

In some of these embodiments, the model optimization module further comprises:

the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive error correction module is stopped;

and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the text error correction method according to the first aspect is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the text error correction method according to the first aspect.

Compared with the prior art, the text error correction method, the text error correction system, the computer equipment and the computer readable storage medium provided by the embodiment of the application improve the error correction effect by using the Soft-Masked BERT pre-training model compared with the original BERT model; in addition, the effect and the performance of the model are obviously improved and the correction accuracy is improved through recursive error correction and vocabulary filtering.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart diagram of a text error correction method according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating the structure of a text correction system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a text correction step according to a preferred embodiment of the present application;

FIG. 4 is a schematic diagram of the principle of the Soft-Masked BERT pre-training model according to the preferred embodiment of the present application.

Description of the drawings:

10. a data acquisition module; 20. a negative sample construction module; 30. a text error correction module;

40. a model optimization module;

301. a word vector acquisition module; 302. a cross entropy loss acquisition module;

401. a recursive prediction module; 402. and a word list filtering module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment provides a text error correction method. Fig. 1 is a schematic flow chart of a text error correction method according to an embodiment of the present application, and referring to fig. 1, the flow chart includes the following steps:

a data obtaining step S10, configured to obtain text data to be corrected;

a negative sample construction step S20, configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, perform random replacement on 15% of the characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.

And a text error correction step S30, wherein the text data to be error corrected and the negative sample data are used as training data, the Chinese character characteristics and the pinyin characteristics of the training data are respectively trained by a Soft-Masked BERT pre-training model and then spliced into a training result, and the training result is subjected to cross entropy loss calculation by a Softmax layer to obtain an error correction result.

And a model optimization step S40, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, adding constraints for model training, and further improving the error correction accuracy and the performance of the model.

In some of these embodiments, the text correction step S30 further includes:

a word vector obtaining step S301, for respectively performing Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features to obtain word vectors of Chinese characters and word vectors of pinyin;

and a cross entropy loss obtaining step S302, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer, and outputting an error correction result.

In some of these embodiments, the model optimization step S40 further includes:

a recursive prediction step S401, configured to input the sentence subjected to error correction in step S30 into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of two recursive predictions are the same, stop the recursive error correction step;

and a vocabulary filtering step S402, which is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to enable the number of search words in the Soft-Masked BERT pre-training model to be less than or equal to 1000 when the error is corrected, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.

The embodiment considers that the Soft-Masked BERT pre-training model is one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. Through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.

The embodiments of the present application are described and illustrated below by means of preferred embodiments.

Fig. 3 is a schematic diagram of a principle of a text error correction step according to a preferred embodiment of the present application, fig. 4 is a schematic diagram of a Soft-Masked BERT pre-training model according to the preferred embodiment of the present application, and with reference to fig. 1, fig. 3, and fig. 4, in this embodiment, text data to be error corrected and negative sample data obtained through steps S10 and S20 are used as training data, chinese character features x1, x2, x3, and x4 of the training data are input to the Soft-Masked BERT pre-training model to perform chinese character Soft-Masked BERT to obtain word vectors of chinese characters, and pinyin features x1 ', x 2', x3 ', and x 4' are word vectors obtained by performing pinyin Soft-Masked BERT retraining through the Soft-Masked BERT pre-training model to obtain pinyin word vectors, and the word vectors are spliced through a Softmax layer to calculate cross entropy loss.

The Soft-Masked BERT pre-training model shown in fig. 4 specifically includes a Detection Network (Detection Network) for predicting the probability that a character is a wrongly written character and a Correction Network (Correction Network) for predicting the probability that an error is corrected, and specifically, the Detection Network is composed of a bidirectional GRU Network (Bi-GRU for short), and is used for sufficiently learning input context information and outputting the probability that each position i may be a wrongly written character, that is, p in the graph_iThe larger the value is, the position is shownThe greater the probability of error, where i is a natural number. Next, detecting a Soft Masking part between the network and the correction network, and representing the characteristic of each position by p_iMultiplying the probability of (1) by the characteristics of the masking character by 1-p_iThe probability of (2) is multiplied by the original input word vector feature, and the last two parts are added as the feature of each character, and the above process can be expressed as:

e_i′＝p_i·e_mask+(1-p_i)·e_i，

in the formula, e_i' feature of each character output, e_maskAs a feature of the masking character, e_iFor input of word vector features, p_iThe probability that the ith character is a wrongly written character is shown, and i is a natural number.

Mixing the above e_i' inputting into a correction network, the correction network is a BERT-based sequence multi-classification label model, and the final characteristic representation of each character obtained by the correction network is the output of the last layer and the word vector characteristic e_iThe loss function is weighted by the detection network and the correction network, via a representation of the residual connection.

For example and without limitation, if the text data to be corrected is "machine learning is a known part of artificial intelligence", the sentence and the negative sample are input into a Soft-Masked BERT pre-training model, the probability that each position is possibly a wrong character is output by a detection network, the "known" is the wrong character, the output probability is higher, the Soft-Masked of the Soft-Masked BERT pre-training model is a Soft-Masking part, the probability that each position is wrong is combined with Masking character characteristics and output to a correction network, the correction network outputs the character with the maximum probability from the word list candidate words as a correct character, and the text to be corrected is "known" to be corrected, so that text correction is realized.

It should be noted that the steps shown in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The embodiment also provides a text error correction system. Fig. 2 is a block diagram of a structure of a text correction system according to an embodiment of the present application. As shown in fig. 2, the text correction system includes: the system comprises a data acquisition module 10, a negative sample construction module 20, a text error correction module 30, a model optimization module 40 and the like. Those skilled in the art will appreciate that the configuration of the text correction system shown in FIG. 2 does not constitute a limitation of the text correction system, and may include more or fewer modules than shown, or some modules in combination, or a different arrangement of modules.

The following describes each constituent module of the text correction system in detail with reference to fig. 2:

the data acquisition module 10 is used for acquiring text data to be corrected;

the negative sample construction module 20 is configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, 15% of the characters in the text to be corrected are randomly replaced, further, 80% of the characters in the 15% randomly replaced characters are replaced with homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.

The text error correction module 30 is used for taking text data to be corrected and negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and then splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results;

and the model optimization module 40 is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, and further improving the error correction accuracy and the model performance.

Wherein the text error correction module 30 further comprises: the word vector acquisition module 301 is configured to perform Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features respectively to obtain word vectors of the Chinese characters and word vectors of the pinyin; the cross entropy loss obtaining module 302 is configured to splice word vectors of the chinese characters and word vectors of the pinyin, calculate a sum of cross entropy losses at each position in the training data through a Softmax layer, and output an error correction result. The model optimization module 40 further comprises: the recursive prediction module 401 is configured to input the error-corrected sentence into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive correction module is stopped; the vocabulary filtering module 402 is configured to filter a vocabulary in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000, where the filtering degree of the vocabulary filtering may be adjusted, and preferably, words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.

In addition, the text error correction method described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.

In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.

The processor may read and execute the computer program instructions stored in the memory to implement any of the text error correction methods in the above embodiments.

In addition, in combination with the text error correction method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the text correction methods in the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text error correction method, comprising:

a data acquisition step, which is used for acquiring text data to be corrected;

a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;

and a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss through a Softmax layer to obtain error correction results.

2. The text error correction method according to claim 1, further comprising a model optimization step for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.

3. The text correction method of claim 1, wherein the text correction step further comprises:

4. The text correction method of claim 2 wherein the model optimization step further comprises:

a recursive prediction step, which is used for inputting the error-corrected sentences obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction;

and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000 when the error is corrected.

5. A text correction system, comprising:

the data acquisition module is used for acquiring text data to be corrected;

the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;

6. The text correction system of claim 5, further comprising a model optimization module for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.

7. The text correction system of claim 5, wherein the text correction module further comprises:

8. The text correction system of claim 6, wherein the model optimization module further comprises:

the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction;

and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words is less than or equal to 1000 when the Soft-Masked BERT pre-training model is corrected.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the text correction method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text correction method according to any one of claims 1 to 4.