CN112329476A

CN112329476A - Text error correction method and device, equipment and storage medium

Info

Publication number: CN112329476A
Application number: CN202011252798.0A
Authority: CN
Inventors: 袁鹏; 李浩然
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-02-05

Abstract

The embodiment of the application discloses a text error correction method, which comprises the following steps: acquiring a text to be corrected; inputting the text to be corrected into a text correction model to obtain a first output and a second output of the text correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result characterizes an error correction result of the training data. In addition, the embodiment of the application also discloses a text error correction device, equipment and a storage medium.

Description

Text error correction method and device, equipment and storage medium

Technical Field

The embodiment of the application relates to the field of natural language processing, and relates to but is not limited to a text error correction method, a text error correction device, text error correction equipment and a text error correction storage medium.

Background

Currently, text error correction methods include the following three methods: text correction using dictionaries, text correction based on edit distance, and text correction based on deep learning models. The text error correction method using the deep learning model can avoid manual feature extraction, reduce manual participation, and has higher accuracy than a text error correction method based on a dictionary or an editing distance.

The deep learning model used in the text error correction method using the deep learning model includes: the method comprises the steps of utilizing a Recurrent Neural Network (RNN), a Conditional Random Field model (CRF), a Bidirectional language model based on conversion (BERT), CRF and the like to extract features of an input text by utilizing the RNN or the BERT, utilizing the CRF to output a wrong word label to perform error recognition, and utilizing a classifier to correct a wrong word, namely to correct the wrong word. In the modeling process, namely the training process, the deep learning models also utilize RNN or BERT to extract the characteristics of input texts, utilize CRF to output wrongly-written labels to recognize errors, then utilize a classifier to correct wrongly-written characters, namely wrongly-written characters, and monitor the models according to the correction result, so that the overall error correction accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for text error correction, a device and a storage medium to solve at least one problem in the related art, so as to improve error correction accuracy.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a text error correction method, where the method includes:

acquiring a text to be corrected;

inputting the text to be corrected into a text correction model to obtain a first output and a second output of the text correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result characterizes an error correction result of the training data.

In a second aspect, an embodiment of the present application provides a text error correction apparatus, where the apparatus includes:

the receiving unit is used for acquiring a text to be corrected;

the error correction unit is used for inputting the text to be corrected into a text error correction model to obtain a first output and a second output of the text error correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result characterizes an error correction result of the training data.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps in the text error correction method when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the text error correction method.

In the embodiment of the application, a text error correction method is provided, and a text to be corrected is obtained; inputting a text to be corrected into a text correction model to obtain a first output and a second output of the text correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result represents an error correction result of the training data; the text error correction task is divided into two tasks of wrongly-written character recognition and wrongly-written character correction, the text error correction model is trained through the execution results of the two tasks, the training result of the text error correction model is supervised by the results of the two tasks, the two tasks are influenced mutually, and the accuracy of the error correction result is improved.

Drawings

FIG. 1 is a schematic diagram of an alternative configuration of an information processing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative configuration of an information processing system according to an embodiment of the present application;

fig. 3 is an alternative flow diagram of a text error correction method provided in an embodiment of the present application;

fig. 4 is an optional structural schematic diagram of a text correction model provided in an embodiment of the present application;

fig. 5 is an alternative flow diagram of a text error correction method provided in an embodiment of the present application;

fig. 6 is an alternative flow chart diagram of a text error correction method provided in an embodiment of the present application;

fig. 7 is an alternative structural schematic diagram of a feature extraction module provided in an embodiment of the present application;

fig. 8 is an alternative flow chart diagram of a text error correction method provided in an embodiment of the present application;

fig. 9 is an alternative schematic flow chart of a text error correction method provided in an embodiment of the present application;

fig. 10 is an alternative flow chart of a text error correction method provided in an embodiment of the present application;

fig. 11 is an alternative structural diagram of a text error correction apparatus according to an embodiment of the present application;

fig. 12 is an alternative structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the following will describe the specific technical solutions of the present application in further detail with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

The embodiment of the application can provide a text error correction method, a text error correction device, text error correction equipment and a storage medium. In practical application, the text error correction method may be implemented by a text error correction apparatus, and each functional entity in the text error correction apparatus may be cooperatively implemented by hardware resources of a computer device (such as a terminal device, a server, or a server cluster), such as computing resources like a processor, and communication resources (such as for supporting communication in various manners like optical cables and cells).

The text error correction method of the embodiment of the application can be applied to the information processing system shown in fig. 1, and text error correction equipment in the information processing system can process a text to be corrected to obtain a first output and a second output, wherein the first output represents an error identification result of the text to be corrected; the second output represents the error correction result of the text to be corrected;

in one example, as shown in FIG. 1, the information handling system includes a client 10 and a server 20; the client 10 is installed with an input application APP capable of receiving user input operation or a browser providing an input page, and a user can receive a text to be corrected input by the user through the input application APP or the input page. The server 20 is a service of the client 10, the server 20 includes a text error correction model, and the server 20 can receive the text to be corrected and correct the text to be corrected based on the text error correction model 201. The client 10 and the server 20 interact with each other via the network 30.

The server 20 may be implemented as a text correction apparatus implementing a text correction method. The server 20 obtains a text to be corrected; inputting a text to be corrected into a text correction model to obtain a first output and a second output of the text correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result represents an error correction result of the training data, and the first output and the second output are pushed to the client 10. The text to be corrected, the first output and the second output may be presented to the user in the client 10 via a display.

In one example, as shown in FIG. 2, only the client 40 is included in the information processing system, and the client 40 is implemented as a text correction apparatus that implements a text correction method. A text correction model is included in the client 40.

The method comprises the steps that a client receives a text to be corrected input by a user through an input application program APP or an input page, inputs the text to be corrected into a text correction model, obtains a first output and a second output of the text correction model, and presents the text to be corrected, the first output and the second output to the user.

In combination with the information processing system, the embodiment provides a text error correction method, which can improve error correction accuracy.

Embodiments of a text error correction method, a text error correction device, a text error correction apparatus, and a storage medium according to the embodiments of the present application are described below with reference to schematic diagrams of information processing systems shown in fig. 1 or fig. 2.

The embodiment provides a text error correction method, which is applied to text error correction equipment, wherein the text error correction equipment can be computer equipment or a distributed network formed by the computer equipment. The functions implemented by the method may be implemented by calling program code by a processor in a computer device, which may, of course, be stored in a computer storage medium, which may comprise at least a processor and a storage medium.

Fig. 3 is a schematic flow chart of an implementation of a text error correction method according to an embodiment of the present application, and as shown in fig. 3, the method may include the following steps:

s301, obtaining a text to be corrected.

Here, the error correction device obtains a text to be corrected, where the text to be corrected includes at least one character, where the character may be a content having a certain semantic meaning, such as a word, a number, an english letter, a symbol, and the like.

In one example, the error correction device is a server. The client receives the text to be corrected based on the input operation of the user and sends the received text with the correction to the server.

In an example, the error correction device is a client, and the client receives the text to be corrected based on an input operation of a user, so as to obtain the text to be corrected.

The method for receiving the text to be corrected by the client may include at least one of the following:

receiving a selection operation of a user for a text, and taking the text corresponding to the selection operation as a text to be corrected;

receiving a text file input by a user in a second receiving mode, and taking the content in the file as a text to be corrected;

receiving a non-text file input by a user, identifying the non-text file, and taking the identified content as a text to be corrected; wherein the non-text file may include: picture, voice, video, etc.

It should be noted that, in the embodiment of the present application, no limitation is imposed on the manner in which the client receives the text to be corrected.

S302, inputting the text to be corrected into a text correction model to obtain a first output and a second output of the text correction model.

Wherein the first output represents the error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result characterizes an error correction result of the training data.

After the error correction device obtains the text to be corrected, the text to be corrected is input into the text error correction model, and the text to be corrected is processed through the text error correction model to obtain a first output and a second output. The first output is a probability set formed by the probability of representing whether each character is a wrong character, and the second output is a text formed by the error correction characters corresponding to each character, namely a corrected correct text.

The text error correction model in the error correction device can be obtained by training the error correction device, or can be trained by other devices except the error correction device, and the trained error correction device model is sent to the error correction device.

In the embodiment of the application, the text set to be corrected outputs two results at the same time: the first output and the second output are used for dividing the task of correcting the error of the text to be corrected into two subtasks: the error recognition task and the error correction task, and the output of the error recognition task is a first output, and the output of the error correction task is a second output.

The training of the text error correction model can be completed based on the training data and the label set corresponding to the training data, the label set is composed of labels corresponding to all characters in the training data, and each label represents whether the corresponding character is a wrongly written or not.

In the training process of the text error correction model, the error correction device inputs training data into the text error correction model to obtain a first prediction result and a second prediction result, calculates the current loss of the text error correction model, namely a first loss, based on the first prediction result and the second prediction result, and adjusts the parameters of the text error correction model based on the calculated first loss, so that the parameters of the text error correction model are simultaneously limited by the first prediction result and the second prediction result, and the first output and the second output by the trained text error correction model are mutually influenced results, thereby improving the error correction rate of the text error correction model.

The text error correction method provided by the embodiment of the application obtains a text to be corrected; inputting a text to be corrected into a text correction model to obtain a first output and a second output of the text correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result represents an error correction result of the training data; the text error correction task is divided into two tasks of wrongly-written character recognition and wrongly-written character correction, the text error correction model is trained through the execution results of the two tasks, the training result of the text error correction model is supervised by the results of the two tasks, the two tasks are influenced mutually, and the accuracy of the error correction result is improved.

In some embodiments, as shown in FIG. 4, the text correction model 400 includes: a feature extraction module 401, an identification module 402 and an error correction module 403.

In the case of the input being a text to be corrected, the feature extraction module 401 is configured to extract a feature sequence of the input text to be corrected, the recognition module 402 is configured to recognize whether an erroneous word is included in the text to be corrected based on the feature sequence, thereby outputting a first output, and the error correction module 403 is configured to recognize a corrected word corresponding to the erroneous word of the text to be corrected based on the first output, thereby outputting a second output.

In the case that the input is training data, the feature extraction module 401 is configured to extract a feature sequence of the input training data, the recognition module 402 is configured to recognize whether a wrongly-written word is included in the training data based on the feature sequence, thereby outputting a first prediction result, and the error correction module 403 is configured to recognize a corrected word corresponding to the wrongly-written word of the training data based on the feature sequence, thereby outputting a second prediction result.

In one example, the structure of the feature extraction module 401 is BERT, CNN, or the like, which can extract features of an input text. The text input during actual prediction is referred to as the text to be corrected, and the text input during training is referred to as training data.

In one example, the structure of the identifying module 402 is a structure capable of performing classification judgment, such as CRF. The input to the recognition module 402 is the output of the feature extraction module 401.

In one example, the structure of the error correction module 403 is a structure capable of performing multi-classification determination, such as full connection. The input to the error correction module 403 is the output of the recognition module 402 at the time of the actual prediction. During training, the input to the error correction module 403 is the output of the feature extraction module 401.

Based on the text correction model shown in fig. 4, as shown in fig. 5, the implementation of S302 includes:

s3021, inputting the text to be corrected into the feature extraction module to obtain a feature sequence output by the feature extraction module;

s3022, inputting the characteristic sequence into the recognition module to obtain the first output by the recognition module;

s3023, inputting the first output into the error correction module to obtain the second output from the error correction module.

Here, the text to be corrected is input into the feature extraction module 401 to obtain a feature sequence output by the feature extraction module 401, the feature sequence is input into the recognition module 402 to obtain a first output by the recognition module 402, and the first output is input into the error correction module 403 to obtain a second output by the error correction module 403.

In some embodiments, as shown in fig. 6, the feature extraction module 401 includes: a feature vector layer 4011 and at least one translation layer 4012; here, the feature vector layer 4011 can convert each character in the inputted text into a vector, and the conversion layer 4012 can encode the vector outputted from the feature vector layer 4011.

In an example, the vectors output by the feature vector layer 4011 may include: word vectors characterizing characters, position information characterizing the position of characters, and segment vectors used to distinguish sentences.

Based on the feature extraction module shown in fig. 6, as shown in fig. 7, the implementation of S3021 includes:

s701, inputting the text to be corrected into the feature vector layer to obtain a text vector sequence output by the feature vector layer;

s702, inputting the text vector sequence into the at least one conversion layer to obtain a coding sequence output by each conversion layer in the at least one conversion layer;

and S703, carrying out weighted summation on the corresponding coding sequence according to the weight corresponding to each conversion layer to obtain the characteristic sequence.

In one example, the text to be corrected is x ([ CLS)],？x₁,......,x_N) Extracting a representation sequence E (E) formed by word vectors of input words in the input text sequence through a feature vector (Embedding) layer_[CLS],？E₁,......,E_N) Then, coding a conversion (Transformer) layer representing a sequence input process, and acquiring a state sequence output by each conversion layer, wherein the state sequence output by the ith conversion layer is T_i＝(C_i,T_i ¹,......,T_i ^N) Finally, the T output by each conversion layer_iWeighting to obtain the final characteristic sequence H corresponding to each input word_j. And forms a characteristic sequence H ═ H of the input text sequence (H)_[CLS],H₁,......,H_N) For example, for word x in the input sequence₁Its characteristic sequence H₁Comprises the following steps:

wherein λ is_iAnd the weight is corresponding to the ith layer conversion layer.

In some embodiments, a method of training a text correction model used in a text correction method includes:

inputting the training data into the text error correction model to obtain the first prediction result and the second prediction result output by the text error correction model;

determining a first loss of the text error correction model based on the first prediction result, the second prediction result and a label set corresponding to the training text; the labels in the label set represent whether corresponding characters in the training data are wrongly written characters or not;

and when the first loss does not meet the training stopping condition, adjusting the parameters of the text error correction model according to the first loss, and continuously inputting the training data into the text error correction model to obtain a new first loss until the first loss of the text error correction model meets the training stopping condition.

As shown in fig. 8, includes:

s801, inputting training data into a text error correction model, and calculating a first loss according to the output of the text error correction model;

inputting the training data into the text error correction model to obtain a first prediction result and a second prediction result, and obtaining a first loss based on the first prediction result and the second prediction result.

S802, judging whether the first loss meets a training stopping condition;

if yes, executing S803; when not, the step S904 is executed,

s803, finishing the training;

and S804, adjusting parameters of the text error correction model according to the current first loss.

After the parameter adjustment of the text error correction model, S801 is performed to input training data into the parameter-adjusted text error correction model.

Here, the text error correction model is iteratively updated for a plurality of times through the training data and the corresponding label set, and the parameters of the text error correction model are adjusted in each iterative update, so that the first loss corresponding to the output result of the text error correction model after the training data is input into the parameter-adjusted text error correction model approaches the training-stopping condition, and when the first loss meets the training-stopping condition, the text error correction model converges, and the adjustment of the parameters of the text error correction model is stopped.

In the embodiment of the present application, the condition for stopping training is a condition for determining whether the text error correction model converges. In one example, the stop training condition is: the first loss is less than a set loss threshold. In one example, the stop training condition is: the difference between the current first loss and the first loss of the last iteration is less than the set loss difference threshold. The training stopping condition can be set according to actual requirements, and the content of the training stopping condition is not limited in the application.

In practical application, the feature extraction module in the text error correction model can be pre-trained through pre-training data without corresponding labels, after the pre-training is completed by the feature extraction module, the feature extraction module is connected with the recognition module and the error correction module to obtain an initial text error correction model, the initial text error correction model is subjected to fine tuning based on the training data and the corresponding label set, and the fine tuning is supervised training.

In some embodiments, the structure of the text correction model is as shown in fig. 4, where the inputting the training data into the text correction model to obtain the first prediction result and the second prediction result output by the text correction model includes:

inputting the training data into the feature extraction module to obtain a training feature sequence output by the feature extraction module;

and respectively inputting the training characteristic sequence into the recognition module and the error correction module to obtain the first prediction result output by the recognition module and the second prediction result output by the error correction module.

Here, as shown in fig. 9, training data 901 is input to the feature extraction module 401 to obtain a training feature sequence 902 output by the feature extraction module 401, the training feature sequence 902 is input to the recognition module 402 and the error correction module 403 respectively to obtain a first prediction result 903 output by the recognition module 402 and a second prediction result 904 output by the error correction module 403.

In some embodiments, when the structure of the feature extraction module is as shown in fig. 6, when the training data is input into the feature extraction module, obtaining the training feature sequence output by the feature extraction module includes: inputting the training data into the feature vector layer to obtain a training text vector sequence output by the feature vector layer; inputting the vector text vector sequence into the at least one conversion layer to obtain a training coding sequence output by each conversion layer in the at least one conversion layer; and carrying out weighted summation on the corresponding training coding sequence according to the weight corresponding to each conversion layer to obtain the training characteristic sequence.

In an example, the training vector sequence output by the vector layer 4011 may include: word vectors characterizing characters, position information characterizing the position of characters, and segment vectors used to distinguish sentences.

In some embodiments, determining the first loss based on the first prediction and the second prediction comprises:

obtaining a first loss through a second loss and a third loss in a first mode;

obtaining a first loss through a second loss, a third loss and a fourth loss;

wherein the second loss is associated with the first prediction, the third loss is associated with the second prediction, and the fourth loss is associated with both the first prediction and the second prediction.

Determining a first loss of the text correction model based on the first prediction result and the second prediction result comprises:

calculating a second loss of the text error correction model based on the first prediction result and a first label corresponding to each character in the training data; calculating a third loss of the text correction model based on a probability distribution of the second prediction result on a generated dictionary; determining the first loss according to the second loss and the third loss.

In the first aspect, determining the first loss according to the second loss and the third loss includes: and directly calculating the first loss through the second loss and the third loss.

In one example, the first loss is calculated by a loss function shown in equation (2),

loss＝μloss_g+(1-μ)loss_pformula (2);

at this time, the calculated loss is the first loss.

In one example，loss_pIs the second loss, loss_gFor the third loss, the function representation corresponding to the second loss is shown in formula (3), and the function identification corresponding to the third loss is shown in formula (4):

wherein, y_jDenotes x_jExpected output tag when x_jWhen a word is wrongly written, y_j＝1。_y_jThe representation model is for x_jActual output label, P_vocab(w_j) Is the probability distribution on the generated dictionary at the predicted position j, and μ is the hyper-parameter of the model, i.e. the pre-set parameter.

The generation dictionary is a candidate set of output words of the text correction model. The probability of an output word in the generation dictionary refers to the probability value of the output word, wherein the probability value with high probability value is output by the text error correction model preferentially. For example, generating the dictionary includes: 3 ten thousand characters, each of the 3 ten thousand characters being assigned a probability value, i.e. a probability distribution. The high probability is output by the text correction model preferentially.

In the second aspect, determining the first loss based on the second loss and the third loss includes: determining the first loss according to the second loss, the third loss and the fourth loss.

In one example, the first loss is calculated by the loss function shown in equation (6),

LOSS＝loss_kl+ loss equation (6);

therein, loss_klIs the fourth loss. At this time, the LOSS calculated is the first LOSS.

In one example, the fourth loss may be calculated by the function shown in equation (5):

in some embodiments, a method of constructing training data for training a text correction model includes:

determining a text with a set construction proportion in a training corpus as a replaced text; replacing the replaced text in the training corpus with a training text to obtain training data; setting labels corresponding to characters in the training text in the training data as a first value; the first value represents that the corresponding character is a wrongly written character; setting labels corresponding to characters in texts except the training text in the training data as second values; the second value characterizes that the corresponding character is not a wrongly written character.

In the embodiment of the application, the training data and the labels corresponding to the characters in the training data are constructed, so that the error correction model is trained supervised through a label set formed by the training data and the labels. And when the value of the label is the second value, the corresponding character is characterized not to be a wrongly written character. In one example, the first value is 1 and the second value is 0.

Here, the construction ratio is a ratio smaller than 1, such as: 10%, 30%, etc. And replacing part of data in the training corpus by error texts to obtain training data by constructing a proportion.

In one example, the replaced text may be a corpus randomly selected according to a construction scale. In one example, the replaced texts are selected at intervals of fixed corpus determination, and the proportion of all the selected replaced texts in the training corpus is the construction proportion. The selection mode for selecting the replaced text is not limited at all in the embodiment of the application.

In an example, the feature extraction module may be subjected to unsupervised pre-training, the feature extraction module subjected to the pre-training is connected with the recognition module and the error correction module to obtain an untrained text error correction model, that is, an initial text error correction model, and the initial text error correction model is trained through training data and a label set corresponding to the training data to obtain a converged text error correction model.

In some embodiments, the replacing the replaced text in the corpus with a training text to obtain training data includes: acquiring a replacement proportion corresponding to each replacement mode in at least one replacement mode; and replacing the replaced text in the training corpus with the training text according to the corresponding replacement proportion through each replacement mode in the at least one replacement mode.

When replacing the replaced text in the corpus, the replacement mode may include one or more of the following replacement modes: random substitution, homophone substitution, homomorphic substitution, etc. Wherein, homophones are characters with different homophones, such as: constant and horizontal, clear and light. Homomorphic characters are characters with similar shapes, such as: visit and spin, please and clear.

In a case where the substitution pattern includes one, the substitution ratio of the substitution pattern is less than or equal to 1. When the number of the alternatives is plural, the sum of the substitution ratios corresponding to the plural alternatives is less than or equal to 1. Such as: the replacement mode for replacing the replaced text in the training corpus comprises the following steps: random replacement and homophone word replacement, wherein the replacement proportion corresponding to the random replacement is 60%, and the replacement proportion corresponding to the homophone word replacement is 40%. For another example: the replacement mode for replacing the replaced text in the training corpus comprises the following steps: random replacement, homophone word replacement, and homomorphic word replacement, and the replacement proportion corresponding to random replacement is 30%, and the replacement proportion corresponding to homophone word replacement is 40%. The replacement proportion corresponding to homophone replacement is 40%.

In one example, a batch of large-scale Chinese texts is used as a training corpus, 10% of characters in the Chinese texts are randomly replaced as replaced texts, and the part of characters are marked as wrongly-written characters through labels corresponding to the part of characters, wherein 20% of the replaced characters are randomly replaced, 40% of the characters are replaced by homophones through a homophone dictionary, and 40% of the characters are replaced by homonyms through a homonym dictionary.

In the following, the text error correction method provided in the embodiment of the present application is further described by taking a scene in which detailed description text information of a product is taken as an input and a summary of the product is output as an example.

The text error correction model used in the embodiment of the application can be a text error correction model of BERT + CRF. BERT is a language representation model that uses a transform (Transformer) bi-directional encoder to represent semantics. Since the structure of BERT is easy to train in parallel, the BERT model learns knowledge in a large-scale unsupervised data pre-training mode and then carries out fine tuning on a text error correction task.

In the embodiment of the application, the text error correction task is divided into the wrongly-written character recognition task and the wrongly-written character correction task, and the two tasks can be trained in a multi-task learning mode so as to simplify the task.

The text error correction method provided by the embodiment of the application comprises the following three aspects in the modeling process:

on the first hand, the actual occurrence characteristics of wrongly written characters are fitted through a large-scale text error correction algorithm training data construction method.

And connecting the pre-trained BERT model with the CRF to obtain an initial text error correction model, and continuously training the initial text error correction model through training data to obtain a text error correction model. The construction process of the training data comprises the following steps: for a batch of large-scale Chinese texts, 10% of words in the texts are randomly replaced, and the part of words are marked as wrongly-written words, wherein 20% of the replaced words are randomly replaced, 40% of the words are replaced by homophones through a homophone dictionary, and 40% of the words are replaced by homomorphs through a homomorphic dictionary.

And in the second aspect, the text error correction is carried out by comprehensively using the characteristics output by each layer of BERT.

In the embodiment of the application, the text error correction task is divided into two subtasks: a wrongly written word recognition task and an error correction task. The wrongly-written character recognition task is used for judging whether the character at a certain position is correct or not, is a two-classification task, and is simple. The error correction task is to identify the corresponding correct word aiming at the wrongly written character, and needs to be classified in the whole quantity of Chinese characters, so that the difficulty is higher compared with two classification tasks. If the task of correcting the wrongly written characters is directly performed, namely, a classification task needs to be performed on each character in the text in the full amount of Chinese characters, the overall correction preparation rate is low. The method of finding wrongly written characters and then correcting the errors is adopted, the method accords with the error correction habit of human beings, can greatly reduce the complexity of text error correction, and improves the accuracy of a text error correction model.

The structure of the text correction model is shown in fig. 10, where the input text sequence x is ([ CLS)],x₁,......,x_N) And (x)₁,......,x_N) To "we … do", a representation sequence E ═ E (E) formed by extracting word vectors of input words in an input text sequence by an emigration vector (Embedding) layer, i.e., a feature extraction layer_[CLS],E₁,......,E_N) Then, coding a Transformer layer representing a sequence input process, and acquiring a state sequence output by each conversion layer, wherein the state sequence output by the ith conversion layer is T_i＝(C_i,T_i ¹,......,T_i ^N) Finally, the T output by each conversion layer_iWeighting to obtain the final characteristic sequence H corresponding to each input word_j. And forms a characteristic sequence H ═ H of the input text sequence (H)_[CLS],H₁,......,H_N) For example, for word x in the input sequence₁Its characteristic sequence H₁Comprises the following steps:

Converting the signature sequence H ═ H (H)_[CLS],H₁,......,H_N) Respectively inputting the CRF layer 1002 and the full connection layer 1003 to obtain the wrongly written character identification labels output by the CRF layer 1002 to represent whether the characters are correct or wrong, and obtaining the wrongly written characters output by the full connection layer 1003The result of the word correction is the correct text.

Wherein, the loss function of the text error correction model is as follows:

loss＝μloss_g+(1-μ)loss_pformula (2);

therein, loss_pIs a loss function that determines the correctness of the word at each position, loss, as shown in equation (3)_gIs the predicted penalty function for the word at each position, as shown in equation (4),

wherein, y_jDenotes x_jExpected output tag when x_jWhen a word is wrongly written, y_j＝1。_y_jThe representation model is for x_jActual output label, P_vocab(w_j) Is the probability distribution over the generated dictionary at the predicted position j, μ is the hyper-parameter of the model.

In a third aspect, a correlation is established between the two tasks of wrongly written word recognition and wrongly written word error correction.

If the CRF in the error correction model judges x_jIf the character is wrongly written, the text error correction model should predict a new character to complete the X-pair_jCorrecting; otherwise, if CRF determines x_jIf not, the maximum probability word predicted by the text error correction model is still x_j. Based on this, a correlation loss function is defined as shown in equation (5):

at this time, the final loss of the text correction model is as shown in equation (6):

LOSS＝loss_kl+ loss equation (6).

The text error correction method provided by the embodiment of the application has the following technical effects:

1. the text error correction algorithm based on supervised training needs large-scale training data, and the data acquisition difficulty is high.

2. Conventional BERT + CRF based models use BERT top-level feature vectors to input CRF for predicting mispronounced word labels and top-level feature vectors to correct mispronounced words. In fact, the feature vectors of each layer of BERT also have different semantics, and using only the top layer results in the loss of information.

3. The tasks of judging wrongly written characters and correcting wrongly written characters are very relevant tasks, and the traditional model has insufficient interaction when modeling the two tasks and does not deeply mine the relevance of the two tasks.

Fig. 11 is a schematic flow chart of an implementation process of a text error correction apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus 1100 includes:

a receiving unit 1101, configured to obtain a text to be corrected;

the error correction unit 1102 is configured to input the text to be error corrected into a text error correction model, so as to obtain a first output and a second output by the text error correction model; the first output represents an error recognition result of the text to be corrected; the second output represents the error correction result of the text to be corrected; the parameters of the text error correction model are obtained based on a first prediction result and a second prediction result corresponding to training data, and the first prediction result represents an error recognition result of the training data; the second prediction result characterizes an error correction result of the training data.

In some embodiments, the text correction model comprises: the device comprises a feature extraction module, an identification module and an error correction module; an error correction unit 1102 configured to:

the extraction unit is used for inputting the text to be corrected into the feature extraction module to obtain a feature sequence output by the feature extraction module;

the recognition unit is used for inputting the characteristic sequence into the error correction module to obtain the first output by the recognition module;

and the correction unit is used for inputting the first output into the error correction module to obtain the second output by the error correction module.

In some embodiments, the feature extraction module comprises: a feature vector layer and at least one translation layer; the extraction unit is configured to:

inputting the text to be corrected into the feature vector layer to obtain a text vector sequence output by the feature vector layer;

inputting the text vector sequence into the at least one conversion layer to obtain a coded sequence output by each conversion layer in the at least one conversion layer;

and carrying out weighted summation on the corresponding coding sequence according to the weight corresponding to each conversion layer to obtain the characteristic sequence.

In some embodiments, the apparatus 1100 further comprises: a training unit to:

and when the first loss does not meet the training stopping condition, adjusting the parameters of the text error correction model according to the first loss, and inputting the training data into the text error correction model after the parameters are adjusted to obtain a new first loss until the first loss of the text error correction model meets the training stopping condition.

In some embodiments, the text correction model comprises: the device comprises a feature extraction module, an identification module and an error correction module; a training unit further to:

In some embodiments, the training unit is further configured to:

calculating a second loss of the text error correction model based on the first prediction result and a first label corresponding to each character in the training data;

calculating a third loss of the text correction model based on a probability distribution of the second prediction result on a generated dictionary;

determining the first loss according to the second loss and the third loss.

In some embodiments, the training unit is further configured to:

calculating a fourth loss of the text correction model based on a probability distribution of the first prediction result and the second prediction result on a generated dictionary;

correspondingly, determining the first loss according to the second loss and the third loss includes:

determining the first loss according to the second loss, the third loss and the fourth loss.

In some embodiments, the apparatus 1100 further comprises: a construction unit for:

determining a text with a set construction proportion in a training corpus as a replaced text;

replacing the replaced text in the training corpus with a training text to obtain training data;

setting labels corresponding to characters in the training text in the training data as a first value; the first value represents that the corresponding character is a wrongly written character;

setting labels corresponding to characters in texts except the training text in the training data as second values; the second value characterizes that the corresponding character is not a wrongly written character.

In some embodiments, the unit is further configured to:

acquiring a replacement proportion corresponding to each replacement mode in at least one replacement mode;

and replacing the replaced text in the training corpus with the training text according to the corresponding replacement proportion through each replacement mode in the at least one replacement mode.

It should be noted that the text error correction apparatus provided in the embodiment of the present application includes each included unit, and may be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the text error correction method is implemented in the form of a software functional module and is sold or used as a stand-alone product, the text error correction method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor executes the computer program to implement the steps in the text error correction method provided in the foregoing embodiment. The electronic device can be a client or a server.

Accordingly, embodiments of the present application provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the text error correction method provided in the above embodiments.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that fig. 12 is a schematic hardware entity diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 12, the electronic device 1200 includes: a processor 1201, at least one communication bus 1202, at least one external communication interface 1204, and memory 1205. Wherein the communication bus 1202 is configured to enable connective communication between such components. In an example, the electronic device 1200 further comprises: a user interface 1203, wherein the user interface 1203 may include a display screen, and the external communication interface 1204 may include standard wired and wireless interfaces.

The Memory 1205 is configured to store instructions and applications executable by the processor 1201, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 1201 and modules in the electronic device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for correcting text, the method comprising:

acquiring a text to be corrected;

2. The method of claim 1, wherein the text correction model comprises: the device comprises a feature extraction module, an identification module and an error correction module; the inputting the text to be corrected into a text correction model to obtain a first output and a second output of the text correction model comprises:

inputting the text to be corrected into the feature extraction module to obtain a feature sequence output by the feature extraction module;

inputting the characteristic sequence into the identification module to obtain the first output by the identification module;

and inputting the first output into the error correction module to obtain the second output by the error correction module.

3. The method of claim 2, wherein the feature extraction module comprises: a feature vector layer and at least one translation layer; the step of inputting the text to be corrected into the feature extraction module to obtain the feature sequence output by the feature extraction module includes:

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein the text correction model comprises: the device comprises a feature extraction module, an identification module and an error correction module; the inputting the training data into the text correction model to obtain the first prediction result and the second prediction result output by the text correction model includes:

6. The method of claim 4, wherein determining the first loss of the text correction model based on the first prediction and the second prediction comprises:

determining the first loss according to the second loss and the third loss.

7. The method of claim 6, wherein determining a first loss of the text correction model based on the first prediction and the second prediction further comprises:

8. The method of claim 4, further comprising:

9. The method according to claim 4, wherein the replacing the replaced text in the corpus to obtain training data comprises:

10. A text correction apparatus, characterized in that the apparatus comprises:

the receiving unit is used for acquiring a text to be corrected;

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the text correction method according to any one of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program is executed by a processor. Implementing the text correction method of any of claims 1 to 9.