CN112380840B

CN112380840B - Text error correction method, device, equipment and medium

Info

Publication number: CN112380840B
Application number: CN202011302530.3A
Authority: CN
Inventors: 郑立颖; 徐亮; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2024-05-07
Anticipated expiration: 2040-11-19
Also published as: CN112380840A; WO2022105083A1

Abstract

The application relates to the technical field of artificial intelligence, and discloses a text error correction method, a device, equipment and a medium, wherein the method comprises the following steps: performing word segmentation and misuse judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential misuse data; inputting the text subjected to engine error correction into a pre-training model to perform substitution probability prediction to obtain a substitution probability prediction result, and determining second potential error word data according to the substitution probability prediction result; determining candidate replacement sentences according to the text subjected to engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored; inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences so as to obtain a plurality of candidate alternative sentence scoring results; and determining the target candidate alternative sentence according to the scoring result of the candidate alternative sentences. Therefore, the recognition of error conditions inside and outside the rule is realized, and the accuracy of text error correction is improved.

Description

Text error correction method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text error correction method, apparatus, device, and medium.

Background

Text correction refers to automatic recognition and correction of problems occurring in the use process of natural language, such as word errors (pain points are written as through points), grammar errors (ground mixed), word collocation errors (auxiliary decisions are written as auxiliary decisions), multiple words, missing words and the like.

Because of the relative terms, terms of art, of a particular context, such as, for example, the abbreviation of an organization: guangdong division is written in Guangdong, an abbreviation inside the division: the meeting summary is written into a meeting period, so that the error correction effect of the text error correction model trained by adopting general corpus (such as wikipedia Chinese corpus and people daily Chinese corpus) is not good.

The existing error correction technology is combined with a rule engine aiming at a specific scene, but only depends on the rule engine to cause limited coverage rate of a text error correction model, and cannot process error conditions beyond rules, and meanwhile, the rule engine can easily cause erroneous judgment.

Disclosure of Invention

The application mainly aims to provide a text error correction method, a device, equipment and a medium, and aims to solve the technical problems that the error correction technology in the prior art only depends on a rule engine, so that the coverage rate of a text error correction model is limited, the error condition beyond the rule cannot be processed, and the rule engine can easily cause misjudgment.

In order to achieve the above object, the present application proposes a text error correction method, the method comprising:

Obtaining a text to be corrected, inputting the text to be corrected into a correction rule engine for correction processing, and obtaining a text corrected by the engine;

Obtaining a target dictionary, and performing word segmentation and misuse judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential misuse data;

Inputting the text subjected to error correction by the engine into a pre-training model to perform substitution probability prediction to obtain a substitution probability prediction result, and determining second potential error word data according to the substitution probability prediction result;

determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored;

Inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences, so as to obtain a plurality of candidate alternative sentence scoring results;

and determining target candidate alternative sentences according to the scoring results of the candidate alternative sentences.

Further, before the step of obtaining the target dictionary, the method includes:

acquiring a plurality of business scene text samples;

Word segmentation is carried out on the plurality of business scene text samples to obtain word sets to be counted;

performing word frequency statistics on each word in the word set to be counted to obtain word frequencies of a plurality of words to be analyzed;

Acquiring word frequency threshold values;

judging whether the word frequency of the plurality of words to be analyzed is larger than the word frequency threshold value or not;

When the word frequency of the word to be analyzed is larger than the word frequency threshold, taking the word corresponding to the word frequency of the word to be analyzed as common word data of a business scene;

Performing new word mining on the plurality of business scene text samples by adopting a new word discovery algorithm of point-to-point information and left and right entropy to obtain business scene new word data;

Acquiring specific word data of a service scene and common word data of a general scene;

and determining the target dictionary according to the business scene common word data, the business scene new word data, the business scene specific word data and the general scene common word data.

Further, the step of performing word segmentation and misuse judgment on the text subjected to the engine error correction according to the target dictionary to obtain first potential misuse word data includes:

word segmentation is carried out on the text subjected to the engine error correction to obtain a plurality of words to be judged;

judging whether the plurality of words to be judged exist in the target dictionary or not;

And when the words to be determined do not exist in the target dictionary, taking a plurality of the words to be determined as the first potential misuse word data.

Further, before the step of inputting the text subjected to error correction by the engine into a pre-training model to perform substitution probability prediction, the method comprises the following steps:

obtaining a plurality of training samples, the training samples comprising: training text sample data and training text sample calibration data;

inputting the training text sample data into a generator to be trained to perform word replacement to obtain a replacement sample sentence;

Inputting the replacement sample sentence into a to-be-trained discriminator for replacement probability prediction to obtain a replacement probability sample predicted value, wherein the to-be-trained discriminator adopts Discriminator of electric;

and training the generator to be trained and the discriminant to be trained according to the substitution probability sample predicted value and the training text sample calibration data, and taking the trained discriminant to be trained as the pre-training model.

Further, the step of determining second potentially erroneous word data according to the substitution probability prediction result includes:

Acquiring a replacement probability threshold;

Extracting a value larger than the replacement probability threshold value from the replacement probability prediction result to obtain target replacement probability prediction data;

and taking the word corresponding to the target replacement probability prediction data as the second potential misuse word data.

Further, the step of determining candidate alternative sentences according to the text corrected by the engine, the first latent error word data and the second latent error word data to obtain a plurality of candidate alternative sentences to be scored includes:

acquiring homonym word dictionary;

selecting words matched with the first potential misuse word data and the second potential misuse word data from the homonym dictionary as candidate words to obtain a candidate word set;

randomly selecting candidate words in the candidate word set to obtain a plurality of candidate word groups;

And respectively replacing the text subjected to engine error correction by each candidate word group to obtain a plurality of candidate replacement sentences to be scored.

Further, the step of determining the target candidate alternative sentence according to the scoring results of the candidate alternative sentences includes:

extracting candidate alternative sentence scoring results with the largest scoring value from the candidate alternative sentence scoring results as target candidate alternative sentence scoring results;

and taking the candidate alternative sentences corresponding to the scoring result of the target candidate alternative sentences as the target candidate alternative sentences.

The application also provides a text error correction device, which comprises:

The engine correction module is used for acquiring a text to be corrected, inputting the text to be corrected into the correction rule engine for correction processing, and obtaining a text corrected by the engine;

The first potential error word data determining module is used for acquiring a target dictionary, and performing word segmentation and error word judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential error word data;

the second potential erroneous word data determining module is used for inputting the text subjected to error correction by the engine into a pre-training model to perform replacement probability prediction to obtain a replacement probability prediction result, and determining second potential erroneous word data according to the replacement probability prediction result;

The candidate replacement sentence determining module to be scored is used for determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored;

the candidate alternative sentence scoring result determining module is used for respectively inputting each candidate alternative sentence to be scored into the statistical language model to score the candidate alternative sentences so as to obtain a plurality of candidate alternative sentence scoring results;

And the target candidate replacement sentence determining module is used for determining target candidate replacement sentences according to the scoring results of the candidate replacement sentences.

The application also proposes a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

The application also proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method of any of the above.

According to the text error correction method, device, equipment and medium, the probability of error position identification is improved by using the rule engine, the target dictionary and the pre-training model in the error detection stage, so that the identification of error conditions inside and outside the rule is realized, and the coverage rate is improved; in the error correction stage, candidate replacement sentences are determined according to the text subjected to engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored, and then the reasonable degree of the replacement words in the candidate replacement sentences is judged by combining with the statistical language model, so that the misjudgment caused by the error detection stage is reduced, and the accuracy of text error correction is improved.

Drawings

FIG. 1 is a flow chart of a text error correction method according to an embodiment of the application;

FIG. 2 is a schematic block diagram of a text error correction apparatus according to an embodiment of the present application;

Fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In order to solve the technical problems that the error correction technology in the prior art only depends on a rule engine to cause limited coverage rate of a text error correction model, and error conditions beyond rules cannot be processed, and the rule engine can easily cause misjudgment, the application provides a text error correction method, which is applied to the technical field of artificial intelligence and is further applied to the technical field of natural language processing of artificial intelligence. According to the text error correction method, an error correction rule engine is adopted to correct errors, a dictionary is used for finding out first potential error word data, a pre-training model is used for determining second potential error word data, candidate replacement sentences are determined according to the first potential error word data and the second potential error word data, the candidate replacement sentences are scored, a text error correction result is determined according to the scoring, and error conditions inside and outside the rules are identified, so that coverage rate is improved, and text error correction accuracy is improved.

Referring to fig. 1, in an embodiment of the present application, there is provided a text error correction method, including:

s1: obtaining a text to be corrected, inputting the text to be corrected into a correction rule engine for correction processing, and obtaining a text corrected by the engine;

S2: obtaining a target dictionary, and performing word segmentation and misuse judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential misuse data;

s3: inputting the text subjected to error correction by the engine into a pre-training model to perform substitution probability prediction to obtain a substitution probability prediction result, and determining second potential error word data according to the substitution probability prediction result;

S4: determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored;

s5: inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences, so as to obtain a plurality of candidate alternative sentence scoring results;

S6: and determining target candidate alternative sentences according to the scoring results of the candidate alternative sentences.

In the embodiment, the possibility of error position identification is improved by using a rule engine, a target dictionary and a pre-training model in the error detection stage, so that the identification of error conditions inside and outside the rule is realized, and the coverage rate is improved; in the error correction stage, candidate replacement sentences are determined according to the text subjected to engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored, and then the reasonable degree of the replacement words in the candidate replacement sentences is judged by combining with the statistical language model, so that the misjudgment caused by the error detection stage is reduced, and the accuracy of text error correction is improved.

For S1, the text to be corrected may be obtained from a database, or may be a text to be corrected input by the user, or may be a text to be corrected sent by another application system.

The text to be corrected is the text which needs text correction.

And inputting the text to be corrected into an error correction rule engine for error word recognition and error word replacement to obtain the text subjected to engine error correction.

The error correction rule engine is a model obtained by training a neural network by using a general corpus, wherein the general corpus comprises but is not limited to: wikipedia Chinese corpus and people daily Chinese corpus.

It will be appreciated that the text to be corrected is of the same language class as the corpus of the training correction rules engine. For example, an error correction rule engine obtained by training a neural network by adopting Chinese prediction is used for carrying out error word recognition and error word replacement on a Chinese text to be corrected, and the example is not particularly limited.

For S2, the target dictionary may be obtained from a database, or may be a target dictionary input by the user, or may be a target dictionary sent by another application system.

The target dictionary includes at least one word.

And performing word segmentation on the text subjected to the engine error correction, performing error word segmentation judgment on a word segmentation result by adopting the target dictionary, and placing the words judged to be error words in a set to obtain first potential error word segmentation data. That is, the first potentially erroneous word data is a collection.

And S3, inputting the text subjected to error correction by the engine into a pre-training model to conduct replacement probability prediction on whether each word is replaced or not, and obtaining a replacement probability prediction result. That is, the substitution probability prediction result includes at least one substitution probability prediction value.

And determining second potentially wrong word data according to all the replacement probability prediction values in the replacement probability prediction results.

The pre-training model can be selected from the prior art, or can be a model obtained based on neural network training.

It will be appreciated that the language categories of the text after the engine has been trained and the text of the training pre-training model are the same. For example, a pre-training model obtained by training a neural network by using a Chinese text is used for performing substitution probability prediction on the text subjected to error correction by the engine in Chinese, and the example is not particularly limited.

It will be appreciated that the steps S2 and S3 may be performed synchronously, or may be performed asynchronously in the order of the steps S3 and S2, which is not particularly limited herein.

And S4, determining candidate words according to the first potential error word data and the second potential error word data, and then determining candidate alternative sentences according to the candidate words and the text subjected to engine error correction to obtain a plurality of candidate alternative sentences to be scored. All possible alternative sentence combinations are included in the plurality of candidate alternative sentences to be scored.

And S5, inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentence, namely, each candidate alternative sentence to be scored corresponds to a candidate alternative sentence scoring result.

The statistical language model can be selected from the prior art, or can be a model based on neural network training.

It is understood that the candidate alternative sentence to be scored is the same as the language category of the text of the statistical language model. For example, a statistical language model obtained by training a neural network with a chinese text is used to score candidate alternative sentences of the chinese candidate alternative sentences to be scored, which is not limited in detail herein.

And S6, extracting a candidate alternative sentence corresponding to the candidate alternative sentence scoring result with the largest scoring value from the candidate alternative sentence scoring results as a target candidate alternative sentence. For example, the scoring results of the candidate replacement sentences are: the score of the candidate alternative sentence B is 80, the score of the candidate alternative sentence a is 70, the score of the candidate alternative sentence D is 60, and if the score of the candidate alternative sentence B is 80, the score of the candidate alternative sentence B is the maximum value, the candidate alternative sentence B corresponding to the score of 80 is used as the target candidate alternative sentence, which is not particularly limited herein.

In one embodiment, before the step of obtaining the target dictionary, the method includes:

S021: acquiring a plurality of business scene text samples;

s022: word segmentation is carried out on the plurality of business scene text samples to obtain word sets to be counted;

s023: performing word frequency statistics on each word in the word set to be counted to obtain word frequencies of a plurality of words to be analyzed;

s024: acquiring word frequency threshold values;

s025: judging whether the word frequency of the plurality of words to be analyzed is larger than the word frequency threshold value or not;

S026: when the word frequency of the word to be analyzed is larger than the word frequency threshold, taking the word corresponding to the word frequency of the word to be analyzed as common word data of a business scene;

S027: performing new word mining on the plurality of business scene text samples by adopting a new word discovery algorithm of point-to-point information and left and right entropy to obtain business scene new word data;

s028: acquiring specific word data of a service scene and common word data of a general scene;

s029: and determining the target dictionary according to the business scene common word data, the business scene new word data, the business scene specific word data and the general scene common word data.

The embodiment realizes that the target dictionary is determined according to the business scene common word data, the business scene new word data, the business scene specific word data and the general scene common word data, so that the target dictionary covers various related terms and professional terms of a general scene and a business scene, the possibility of error position identification is improved through the target dictionary in an error detection stage, and the coverage rate is improved.

For S021, a plurality of service scene text samples may be obtained from the database, or may be a plurality of service scene text samples input by the user, or may be a plurality of service scene text samples sent by other application systems.

The business scene text sample is text data used by the business scene.

And S022, word segmentation is carried out on the plurality of business scene text samples, and words obtained by word segmentation are placed in a set to obtain a word set to be counted.

For S023, respectively calculating the occurrence times of each word in the word set to be counted to obtain a plurality of target occurrence times; acquiring the total number of words in the word set to be counted to obtain the total number of target words; dividing each target occurrence frequency by the total number of the target words to obtain word frequencies of a plurality of words to be analyzed. That is, each word in the word set to be counted corresponds to a word frequency of the word to be analyzed.

The word frequency to be analyzed is the word frequency to be analyzed.

For S024, the word frequency threshold may be obtained from the database, or may be a word frequency threshold input by the user, or may be a word frequency threshold sent by another application system.

The word frequency threshold is a specific value of 0 to 1.

And for S025, judging whether each word frequency of the plurality of words to be analyzed is larger than the word frequency threshold value or not.

For S026, when the word frequency of the word to be analyzed is greater than the word frequency threshold, it means that the probability that the word corresponding to the word frequency of the word to be analyzed is a common word of the service scene is relatively high, and at this time, the word corresponding to the word frequency of the word to be analyzed is used as the common word data of the service scene, which is beneficial to improving the accuracy of the common word data of the service scene.

For S027, the step of performing new word mining on the plurality of service scene text samples by using a new word discovery algorithm of inter-point mutual information and left-right entropy to obtain service scene new word data includes:

s0271: generating an n_gram (probability-based discriminant model) dictionary according to the plurality of business scene text samples;

The method for generating the n_gram dictionary according to the plurality of business scene text samples may be selected from the prior art, and will not be described herein.

S0272: screening alternative new words from the n_gram dictionary by adopting an inter-point information method to obtain new word data to be selected;

The method for screening the candidate new words from the n_gram dictionary by adopting the inter-point information method can be selected from the prior art, and is not described herein.

S0273: and selecting new words from the new word data to be selected by adopting a left-right entropy method to obtain new word data of the service scene.

The method for selecting the new word from the new word data to be selected by adopting the left-right entropy method can be selected from the prior art, and will not be described herein.

For S028, the service scenario specific word data may be obtained from a database, or may be service scenario specific word data input by a user, or may be service scenario specific word data sent by another application system.

The general scene common word data can be obtained from a database, can be input by a user, and can be sent by other application systems.

The business scenario specific word data is a word formed by the characteristics of the business scenario. For example, the "precedent, and precedent" of the view in the organization may be regarded as a specific word of the business scene, and the example is not limited in detail.

The common word data of the general scene is words which are frequently used in most scenes.

And S029, putting the business scene common word data, the business scene new word data, the business scene specific word data and the common scene common word data into a set, and taking the obtained set as the target dictionary. That is, the target dictionary covers various related terms and terms of the general scene, the business scene.

In one embodiment, the step of performing word segmentation and misuse judgment on the text after the engine error correction according to the target dictionary to obtain first potentially misuse data includes:

s21: word segmentation is carried out on the text subjected to the engine error correction to obtain a plurality of words to be judged;

S22: judging whether the plurality of words to be judged exist in the target dictionary or not;

s23: and when the words to be determined do not exist in the target dictionary, taking a plurality of the words to be determined as the first potential misuse word data.

According to the method and the device, the wrong word judgment is carried out according to the target dictionary, and the target dictionary covers various related terms and professional terms of a general scene and a business scene, so that the possibility of wrong position recognition is improved, and the coverage rate of first potential wrong word data is improved.

And S21, segmenting the text subjected to error correction by the engine, and taking the words obtained by segmentation as words to be determined.

The word to be determined refers to the word which needs to be determined whether to misuse the word.

And S22, searching each word to be judged in the plurality of words to be judged in the target dictionary, wherein the word to be judged is a correct word when the same word is searched in the target dictionary, and the word to be judged is an incorrect word when the same word is not searched in the target dictionary, and determining that the word to be judged does not exist in the target dictionary.

For S23, when the word to be determined does not exist in the target dictionary, this means that the word to be determined is a misuse word, and all the words to be determined that do not exist in the target dictionary are taken as the first potentially misuse word data.

In one embodiment, before the step of inputting the text after error correction into the pre-training model to perform the substitution probability prediction, the method includes:

s031: obtaining a plurality of training samples, the training samples comprising: training text sample data and training text sample calibration data;

S032: inputting the training text sample data into a generator to be trained to perform word replacement to obtain a replacement sample sentence;

S033: inputting the replacement sample sentence into a to-be-trained discriminator for replacement probability prediction to obtain a replacement probability sample predicted value, wherein the to-be-trained discriminator adopts Discriminator of electric;

S034: and training the generator to be trained and the discriminant to be trained according to the substitution probability sample predicted value and the training text sample calibration data, and taking the trained discriminant to be trained as the pre-training model.

The arbiter to be trained in the embodiment adopts Discriminator of electric (EFFICIENTLY LEARNING AN Encoder THAT CLASSIFIES Token Replacement Accurately), which is a correct value of the de-prediction Mask relative to Bert (Bidirectional Encoder Representations from Transformers, pre-training language representation model), and whether electric is used for predicting Token or not is replaced, so that training efficiency is improved; and enabling the trained discriminators to be trained to serve as the pre-training model, so that the pre-training model can predict the probability of each replacement.

S031, a plurality of training samples can be obtained from a database, or a plurality of training samples input by a user, or a plurality of training samples sent by other application systems.

Each training sample comprises a training text sample data and a training text sample calibration data, wherein the training text sample calibration data is a calibration value for each word in the training text sample data to be replaced.

Training text sample data is text data.

The training text sample calibration data is a one-dimensional vector, and each vector element represents a calibration value of a word replaced in the training text sample data.

S032, inputting the training text sample data into a generator to be trained to perform word replacement to obtain replacement sample sentences, namely, each training text sample data corresponds to one replacement sample sentence.

Preferably, the Generator to be trained uses a Generator model.

The training text sample data is set with [ MASK ] through random selection, and then is input to a Generator model, and the Generator model is responsible for changing the [ MASK ] into replaced words. The Generator model does not require gradients back in the discriminators waiting to be trained, as in the neural network, but rather attempts to predict the correct word like Bert.

S033, inputting the replacement sample sentences into a to-be-trained discriminator for replacement probability prediction to obtain replacement probability sample predicted values, namely, each replacement sample sentence corresponds to one replacement probability sample predicted value.

Discriminator predicts whether the term at each position in the replacement sample sentence was replaced.

S034, the step of training the generator to be trained and the discriminant to be trained according to the substitution probability sample predicted value and the training text sample calibration data, and taking the trained discriminant to be trained as the pre-training model comprises the following steps:

S0341: inputting the predicted value of the substitution probability sample and the calibration data of the training text sample into a loss function to calculate to obtain a target loss value, updating the parameters of the generator to be trained and the parameters of the discriminator to be trained according to the target loss value, wherein the updated generator to be trained and the updated discriminator to be trained are used for calculating the predicted value of the substitution probability sample next time;

S0342: repeating the steps until the loss value reaches a first convergence condition or the iteration number reaches a second convergence condition, and determining the to-be-trained discriminator with the target loss value reaching the first convergence condition or the iteration number reaching the second convergence condition as the pre-training model.

The first convergence condition means that the magnitude of the target loss value calculated in two adjacent times satisfies a lipschitz condition (lipschitz continuous condition).

The iteration number refers to the number of times the generator to be trained and the arbiter to be trained are used to calculate the substitution probability sample predicted value, that is, the number of iterations is calculated once, and the number of iterations is increased by 1. The second convergence condition is a preset sub-value.

The loss function may be selected from the prior art and will not be described in detail herein.

In one embodiment, the step of determining the second potentially erroneous word data according to the substitution probability prediction result includes:

s31: acquiring a replacement probability threshold;

s32: extracting a value larger than the replacement probability threshold value from the replacement probability prediction result to obtain target replacement probability prediction data;

S33: and taking the word corresponding to the target replacement probability prediction data as the second potential misuse word data.

According to the method, the words corresponding to the values larger than the replacement probability threshold in the replacement probability prediction result are used as the second potential misuse word data, so that misjudgment is reduced, and accuracy of the second potential misuse word data is improved.

For S31, the replacement probability threshold may be obtained from the database, or may be a replacement probability threshold input by the user, or may be a replacement probability threshold sent by another application system.

The replacement probability threshold is a specific value of 0 to 1.

And S32, extracting replacement probability predicted values larger than the replacement probability threshold value from all the replacement probability predicted values of the replacement probability predicted result, and taking the found replacement probability predicted values as target replacement probability predicted data. That is, the target replacement probability prediction data may have one value, may have a plurality of values, or may have zero values. By taking the replacement probability prediction value larger than the replacement probability threshold value as target replacement probability prediction data, the replacement probability prediction value smaller than or equal to the replacement probability threshold value is deleted, noise data is reduced, misjudgment is reduced, and accuracy of second potential misuse word data is improved.

And for S33, placing the words corresponding to the target replacement probability prediction data in a set to obtain the second potential misuse word data.

In one embodiment, the step of determining candidate alternative sentences according to the text corrected by the engine, the first latent error word data and the second latent error word data to obtain a plurality of candidate alternative sentences to be scored includes:

S41: acquiring homonym word dictionary;

s42: selecting words matched with the first potential misuse word data and the second potential misuse word data from the homonym dictionary as candidate words to obtain a candidate word set;

S43: randomly selecting candidate words in the candidate word set to obtain a plurality of candidate word groups;

S44: and respectively replacing the text subjected to engine error correction by each candidate word group to obtain a plurality of candidate replacement sentences to be scored.

The embodiment determines candidate replacement sentences through the text subjected to error correction by the engine, the first potential error word data and the second potential error word data, so that a data basis is provided for scoring the candidate replacement sentences.

For S41, the homonym dictionary may be obtained from a database, or may be a homonym dictionary input by the user, or may be a homonym dictionary sent by another application system.

Homonym word dictionary includes: homonym sub-dictionary and homonym sub-dictionary. The homophone sub-dictionary includes: the words before the first replacement and the words after the first replacement have the same pronunciation as the words after the first replacement. The isomorphism word sub-dictionary includes: the word before the second replacement and the word after the second replacement are similar in shape.

For S42, matching each word in the first potentially incorrect word data in the homonym dictionary, and taking the word matched in the homonym dictionary as a first candidate word; matching each word in the second potentially erroneous word data in the homonym dictionary, and taking the word matched in the homonym dictionary as a second candidate word; and putting all the first candidate words and all the second candidate words into a set to obtain a candidate word set.

For S43, the candidate words in the candidate word set are randomly combined, and each combination is used as a candidate word group. It will be appreciated that a plurality of groupings of candidate words encompasses all possible groupings of candidate words in the set of candidate words.

And for S44, each candidate word group is used for replacing the text subjected to the engine error correction, so that candidate replacement sentences to be scored are obtained. That is, each candidate word group corresponds to a candidate alternative sentence to be scored.

The candidate replacement sentences to be scored refer to candidate replacement sentences needing to be scored.

In another embodiment, the candidate word set may be determined by candidate word matching using a homonym dictionary and a dictionary other than the homonym dictionary, which is not specifically limited herein.

In one embodiment, the step of determining the target candidate alternative sentence according to the scoring result of the candidate alternative sentences includes:

s61: extracting candidate alternative sentence scoring results with the largest scoring value from the candidate alternative sentence scoring results as target candidate alternative sentence scoring results;

s62: and taking the candidate alternative sentences corresponding to the scoring result of the target candidate alternative sentences as the target candidate alternative sentences.

The embodiment realizes that the candidate alternative sentence scoring result corresponding to the maximum value in the candidate alternative sentence scoring results is used as a target candidate alternative sentence scoring result, and the candidate alternative sentence corresponding to the target candidate alternative sentence scoring result is used as the target candidate alternative sentence, so that the accuracy of the determined target candidate alternative sentence is further improved.

And S61, extracting the largest candidate alternative sentence scoring result from the plurality of candidate alternative sentence scoring results, and taking the extracted largest candidate alternative sentence scoring result as a target candidate alternative sentence scoring result.

Referring to fig. 2, the present application also proposes a text error correction apparatus, the apparatus comprising:

The engine correction module 100 is configured to obtain a text to be corrected, input the text to be corrected into a correction rule engine to perform correction processing, and obtain a text corrected by the engine;

The first potentially erroneous word data determining module 200 is configured to obtain a target dictionary, and perform word segmentation and erroneous word judgment on the text subjected to engine error correction according to the target dictionary to obtain first potentially erroneous word data;

The second potentially erroneous word data determining module 300 is configured to input the text corrected by the engine into a pre-training model to perform substitution probability prediction, obtain a substitution probability prediction result, and determine the second potentially erroneous word data according to the substitution probability prediction result;

the candidate replacement sentence determining module 400 to be scored is configured to determine candidate replacement sentences according to the text corrected by the engine, the first potential erroneous word data and the second potential erroneous word data, so as to obtain a plurality of candidate replacement sentences to be scored;

the candidate alternative sentence scoring result determining module 500 is configured to input each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences, so as to obtain a plurality of candidate alternative sentence scoring results;

the target candidate replacement sentence determining module 600 is configured to determine a target candidate replacement sentence according to the scoring results of the candidate replacement sentences.

Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as text error correction methods and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text error correction method. The text error correction method comprises the following steps: obtaining a text to be corrected, inputting the text to be corrected into a correction rule engine for correction processing, and obtaining a text corrected by the engine; obtaining a target dictionary, and performing word segmentation and misuse judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential misuse data; inputting the text subjected to error correction by the engine into a pre-training model to perform substitution probability prediction to obtain a substitution probability prediction result, and determining second potential error word data according to the substitution probability prediction result; determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored; inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences, so as to obtain a plurality of candidate alternative sentence scoring results; and determining target candidate alternative sentences according to the scoring results of the candidate alternative sentences.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a text error correction method, comprising the steps of: obtaining a text to be corrected, inputting the text to be corrected into a correction rule engine for correction processing, and obtaining a text corrected by the engine; obtaining a target dictionary, and performing word segmentation and misuse judgment on the text subjected to engine error correction according to the target dictionary to obtain first potential misuse data; inputting the text subjected to error correction by the engine into a pre-training model to perform substitution probability prediction to obtain a substitution probability prediction result, and determining second potential error word data according to the substitution probability prediction result; determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored; inputting each candidate alternative sentence to be scored into a statistical language model to score the candidate alternative sentences, so as to obtain a plurality of candidate alternative sentence scoring results; and determining target candidate alternative sentences according to the scoring results of the candidate alternative sentences.

According to the text error correction method, the possibility of error position recognition is improved by using the rule engine, the target dictionary and the pre-training model in the error detection stage, so that the recognition of error conditions inside and outside the rule is realized, and the coverage rate is improved; in the error correction stage, candidate replacement sentences are determined according to the text subjected to engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored, and then the reasonable degree of the replacement words in the candidate replacement sentences is judged by combining with the statistical language model, so that the misjudgment caused by the error detection stage is reduced, and the accuracy of text error correction is improved.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method of text correction, the method comprising:

determining target candidate alternative sentences according to the scoring results of the candidate alternative sentences;

the step of determining the second potentially erroneous word data according to the replacement probability prediction result includes:

Acquiring a replacement probability threshold;

taking the word corresponding to the target replacement probability prediction data as the second potential error word data;

The step of determining candidate replacement sentences according to the text subjected to the engine error correction, the first potential error word data and the second potential error word data to obtain a plurality of candidate replacement sentences to be scored comprises the following steps:

acquiring homonym word dictionary;

2. The text correction method of claim 1, wherein prior to the step of obtaining the target dictionary, comprising:

acquiring a plurality of business scene text samples;

Acquiring word frequency threshold values;

3. The text correction method according to claim 1, wherein the step of performing word segmentation and misuse judgment on the text subjected to the engine correction according to the target dictionary to obtain first potentially misuse data includes:

4. The text error correction method according to claim 1, wherein before the step of inputting the text after error correction by the engine into a pre-training model to perform substitution probability prediction, the method comprises:

5. The text error correction method of claim 1, wherein the step of determining a target candidate alternative sentence based on the plurality of candidate alternative sentence scoring results comprises:

6. A text error correction apparatus for performing the method of any of claims 1-5, the apparatus comprising:

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.