CN118036595A

CN118036595A - Text error correction method, device, computer equipment and storage medium

Info

Publication number: CN118036595A
Application number: CN202410159507.5A
Authority: CN
Inventors: 刘杨; 张文斌; 林跃; 卢品吟; 刘赫阳
Original assignee: Shenzhen Dongxin Cloud Technology Co ltd
Current assignee: Shenzhen Dongxin Cloud Technology Co ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-05-14

Abstract

The invention relates to the technical field of natural language processing, and discloses a text error correction method, a text error correction device, computer equipment and a storage medium. The method comprises the steps of obtaining a preset error correction model which is obtained based on corpus training in the target field and comprises an inspection network and an error correction network; inputting the text to be processed into an inspection network for inspection processing, and determining the text to be corrected; performing replacement processing on target misplaced words in the text to be corrected based on a preset replacement dictionary to obtain a dictionary replacement text; if the dictionary replacement text contains the non-replaced wrongly written characters, inputting the dictionary replacement text into an error correction network for error correction processing to obtain a candidate character set; and determining a target replacement word corresponding to the non-replaced wrongly written word in the candidate word set according to the received candidate word selection instruction, and generating the corrected text to be processed according to the target replacement word and the dictionary replacement text. The invention ensures the professionality and field applicability of text correction and improves the smoothness and accuracy of corrected text.

Description

Text error correction method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text error correction method, apparatus, computer device, and storage medium.

Background

In the age of information explosion, mass information data is generated every day in the internet platform. Due to the large amount of information, the high update speed and the possible negligence of the examination process, wrongly written words can exist in some information. The presence of misplaced words is not only detrimental to reading and understanding the information, but also can cause errors in the content analysis and data processing of the information.

In order to eliminate the influence of wrongly written words in information, the prior art implements error correction of information text by some means. For example, spelling errors in text may be detected by pre-defined spelling rules and dictionaries in a spell checker, grammar errors in text (e.g., erroneous sentence structure or inconsistent grammar) may be detected and corrected by a grammar checker, model training by using a labeled text dataset, error correction based on a deep learning model obtained by training, etc.

As social and economic activities become more and more abundant, the professional requirements of text information involved in different fields are also different. For example, in the commodity marketing field, commodity sales involve specific brand names and proprietary terms, and commodity services involve information on various aspects such as market research, brand positioning, advertising, public relations, and channel management. However, the existing methods still have respective defects, influence the effect of practical application, and cannot effectively process texts in specific fields. Spell-checker has poor applicability to new words or terms of art because these words may not be in a pre-defined dictionary. Grammar inspectors cannot understand complex semantic relationships in context when processing long text. The training of the deep learning model generally requires large-scale training data, and has high requirement on computing resources, and when the deep learning model is applied to a specific field, a large amount of resources are required for retraining.

Disclosure of Invention

Based on the above, it is necessary to provide a text error correction method, apparatus, computer device and storage medium to solve the problems of poor applicability, weak understanding ability of semantic relationship and large resource consumption of the existing text error correction means.

A text error correction method, comprising:

Acquiring a preset error correction model, wherein the preset error correction model is a neural network model which is obtained based on target field corpus training and comprises an inspection network and an error correction network;

inputting a text to be processed into the checking network, checking the text to be processed, and determining a text to be corrected, wherein the text to be corrected comprises at least one target mispronounced word;

Acquiring a preset replacement dictionary, and carrying out replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected;

If the dictionary replacement text contains the non-replaced wrongly written words, inputting the dictionary replacement text into the error correction network, and performing error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the non-replaced wrongly written words;

When a candidate word selection instruction is received, determining a target replacement word corresponding to the non-replaced wrongly written word in the candidate word set according to the candidate word selection instruction, and generating an error-corrected text to be processed according to the target replacement word and the dictionary replacement text.

A text error correction apparatus comprising:

the model acquisition module is used for acquiring a preset error correction model, wherein the preset error correction model is a neural network model which is obtained based on target field corpus training and comprises an inspection network and an error correction network;

the checking processing module is used for inputting a text to be processed into the checking network, checking the text to be processed, and determining a text to be corrected, wherein the text to be corrected comprises at least one target mispronounced word;

The dictionary processing module is used for acquiring a preset replacement dictionary, and carrying out replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected;

the error correction processing module is used for inputting the dictionary replacement text into the error correction network if the non-replaced wrongly written words exist in the dictionary replacement text, and performing error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the non-replaced wrongly written words;

And the replacement word determining module is used for determining a target replacement word corresponding to the non-replacement misplaced word in the candidate word set according to the candidate word selecting instruction when the candidate word selecting instruction is received, and generating an error-corrected text to be processed according to the target replacement word and the dictionary replacement text.

A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the text error correction method described above when executing the computer readable instructions.

A computer-readable storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform a text error correction method as described above.

In the text error correction method, the text error correction device, the computer equipment and the storage medium, the preset error correction model of the method is a neural network model obtained based on the corpus training in the target field, the model training is not required to be carried out on the whole data, and the data resources are saved. Meanwhile, the model obtained through training can be focused on solving the problem of text correction in the specific field, and the expertise of text correction and the applicability to the specific field are improved. In addition, the text to be corrected is determined through the checking network of the preset correction model, part of wrongly written characters in the text to be corrected is replaced by the preset replacement dictionary, and the other part of wrongly written characters which cannot be processed by the preset replacement dictionary are corrected through the correction network of the preset correction model, so that the semantic relation understanding capability of text correction is improved, the grammar and the syntax structure are ensured to be correct, the occurrence of misleakage is avoided, and the smoothness and the accuracy of the corrected whole text are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a text error correction method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a text error correction apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a computer device in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In one embodiment, as shown in fig. 1, a text error correction method is provided, which includes the following steps S10-S50:

S10, acquiring a preset error correction model, wherein the preset error correction model is a neural network model which is obtained based on corpus training in the target field and comprises an inspection network and an error correction network.

It is to be understood that the text correction method in this embodiment is implemented based on communication between a client (text correction device), which may be a server, a personal computer, a notebook computer, or the like, and a server is taken as an example in this embodiment. The neural network-based error correction model can effectively identify and correct errors in text when processing natural language. The preset error correction model is a neural network model which is obtained based on target domain corpus training and comprises an inspection network and an error correction network, wherein the inspection network is used for identifying errors in texts, and the error correction network is used for correcting the identified errors. The target domain corpus is a corpus text collected in a specific domain according to the need, and the target domain can be medical, legal, scientific and marketing domain and the like. The target domain corpus is collected from different data sources in different modes, for example, the corpus can be crawled from web pages and social media through web crawlers, and the corpus can also be obtained through public data sets and professional institution reports.

In this embodiment, the target domain is a marketing domain, and the target domain corpus is a corpus obtained by preprocessing the recalled marketing domain full-volume corpus data in a specified time (such as the last year) through cleaning, word segmentation, stop word removal and the like after the recalled marketing domain full-volume corpus data is recalled by brand words. The preprocessed target domain corpus comprises brand names and professional terms of the marketing domain, so that the trained model can concentrate on the context of the marketing domain, and the professionality of text correction is improved. In the model training process, an initial neural network model comprising an inspection network and an error correction network is required to be constructed, and then the target domain corpus is used as sample data to train the inspection network. Wherein the checking network needs to learn how to recognize the error in the sample data, and when the checking network recognizes that a word is wrong, the error correction network is trained by using the output of the checking network, and the error correction network needs to learn how to correct the error. And integrating the trained checking network and the error correction network into a model, and determining the model as a preset error correction model.

S20, inputting the text to be processed into the checking network, checking the text to be processed, and determining the text to be corrected, wherein the text to be corrected comprises at least one target mispronounced word.

The text to be processed is understandably text that needs to be subjected to error recognition and correction processing, for example, an article composed of a plurality of sentences, and is sent from the client to the server. The text to be processed may not have errors, and may have some spelling errors, grammar errors, missing or redundant words, etc. The text to be corrected is the text to be processed for which errors are determined to exist, and only one wrongly written word or a plurality of wrongly written words may exist in the text to be corrected. The target misplaced word refers to the misplaced word in the text to be corrected that needs to be corrected.

In this embodiment, in order to identify an error in a text to be processed, the text to be processed needs to be input into an inspection network of a preset error correction model, and the sentence in the text to be processed is processed through the inspection network by performing spelling inspection, grammar inspection, semantic analysis, context inspection and the like, so that the sentence with the error in the text is identified, and the text to be corrected is obtained. Errors in the text to be corrected may include spelling errors, grammar errors, or semantic errors, but these errors are all embodied by wrongly written words, i.e., the wrongly written word is the smallest unit of error in the text to be corrected. Therefore, the checking network outputs the text to be corrected and marks all target wrongly written words in the text to be corrected.

S30, acquiring a preset replacement dictionary, and carrying out replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected.

The preset replacement dictionary is a preset dictionary data table containing initial words and replacement words corresponding to the initial words, and may include homonym dictionary, similar word dictionary, complex word dictionary, normal error word dictionary, and the like according to the corresponding relationship between the initial words and the replacement words. In the text error correction process, a preset replacement dictionary is used for searching and replacing target misplaced words in the text to be corrected, when an initial word consistent with the target misplaced words in the text to be corrected can be searched in the preset replacement dictionary, the replacement word corresponding to the initial word is used as a replacement word matched with the target misplaced words, and the target misplaced words are replaced to finish correction of the misplaced words. The dictionary replacement text refers to text after one or more target misplaced words in the text to be corrected are replaced by replacement words matched with the target misplaced words in a preset replacement dictionary. For example, one target mispronounced word in the text to be corrected is "obtained" of "obtained object", a group of dictionary data including an initial word "obtained" and a replacement word "good" corresponding to the initial word "obtained" exists in the preset replacement dictionary, and the replacement processing is performed on the target mispronounced word "obtained" by using the replacement word "good" corresponding to the initial word "obtained" in the preset replacement dictionary, so as to obtain a dictionary replacement text corresponding to the text to be corrected, namely, the dictionary replacement text is not "obtained object" but "good".

And S40, if the dictionary replacement text contains the non-replaced wrongly written words, inputting the dictionary replacement text into the error correction network, and performing error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the non-replaced wrongly written words.

The dictionary replacement text is a text to be corrected after the replacement processing based on a preset replacement dictionary, and the replacement processing of the target misplaced word can be completed only when the replacement word corresponding to the target misplaced word can be found in the preset replacement dictionary. Because the initial words and the replacement words recorded in the preset replacement dictionary are limited, when the replacement words corresponding to the target misplaced words cannot be found in the preset replacement dictionary, the replacement processing of the target misplaced words cannot be completed. On the premise that the text to be corrected comprises a plurality of target misplaced words, when only the replacement processing of one part of target misplaced words in the text to be corrected is completed based on a preset replacement dictionary, the other part of target misplaced words in the text to be corrected are not corrected, and at the moment, the dictionary replacement text can contain unreplaced misplaced words. The non-replaced wrongly written word refers to the wrongly written word of the dictionary replacement text to be corrected, namely the target wrongly written word which cannot be subjected to replacement processing based on the preset replacement dictionary in the text to be corrected. The set of candidate words corresponding to the non-replaced wrongly written word refers to a set of candidate words output by the error correction network for correcting the non-replaced wrongly written word. In order to embody the diversity and rationality of correction, each of the non-substituted wrongly written characters corresponds to a preset number (e.g., 3) of candidate characters, and when there are a plurality of non-substituted wrongly written characters, all the candidate characters corresponding to the non-substituted wrongly written characters constitute a candidate character set.

In this embodiment, when it is determined that there are non-substituted wrongly written words in the dictionary-substituted text, the dictionary-substituted text is required to be input into an error correction network, and the error correction network generates a series of possible candidate words for each non-substituted wrongly written word, the candidate words being words predicted based on the text's semantics and context information for correcting the non-substituted wrongly written words. And the error correction network combines all the obtained candidate words of the non-replaced wrongly written words, and finally outputs a candidate word set corresponding to the non-replaced wrongly written words.

And S50, when a candidate word selection instruction is received, determining a target replacement word corresponding to the non-replaced misplaced word in the candidate word set according to the candidate word selection instruction, and generating an error-corrected text to be processed according to the target replacement word and the dictionary replacement text.

The candidate word selection instruction is understandably an instruction for selecting one or more candidate words from a set of candidate words. When the dictionary replacement text only comprises one non-replaced wrongly written word, the candidate word set only comprises candidate words corresponding to the non-replaced wrongly written word, and the candidate word selection instruction is used for selecting one candidate word from the candidate word set as a target replacement word corresponding to the non-replaced wrongly written word. When the dictionary replacement text comprises a plurality of non-replaced wrongly written words, the candidate word set comprises candidate words corresponding to the non-replaced wrongly written words, and the candidate word selection instruction is used for selecting the candidate words corresponding to the number of the non-replaced wrongly written words from the candidate word set as target replacement words corresponding to the non-replaced wrongly written words one by one. Candidate word selection instructions may be generated manually based on a user's selection operation or may be generated automatically based on a selection rule of the system (e.g., default selection ranking first). The target replacement word is a candidate word which is selected according to the candidate word selection instruction and is used for correcting the non-replaced wrongly written word.

In this embodiment, when the candidate word selection instruction is received, the target replacement word corresponding to the non-replaced misplaced word in the candidate word set is further determined according to the candidate word selection instruction. And replacing the non-replaced wrongly written words in the dictionary replacement text according to the corresponding relation between the target replacement words and the non-replaced wrongly written words, and generating corrected text to be processed, wherein any wrongly written words needing correction do not exist in the corrected text to be processed.

The preset error correction model of the embodiment is a neural network model obtained based on corpus training in the target field, does not need full data for model training, and saves data resources. Meanwhile, the model obtained through training can be focused on solving the problem of text correction in the specific field, and the expertise of text correction and the applicability to the specific field are improved. In addition, the text to be corrected is determined through the checking network of the preset correction model, part of wrongly written characters in the text to be corrected is replaced by the preset replacement dictionary, and then the other part of wrongly written characters which cannot be processed by the preset replacement dictionary are corrected through the correction network of the preset correction model, so that the semantic relation understanding capability of text correction is improved, grammar and syntax structures are ensured to be correct, misleakage is avoided, and the smoothness and accuracy of the corrected whole text are improved.

In an embodiment, in step S20, the performing the checking process on the text to be processed to determine the text to be corrected includes:

s201, performing general identification on all sentences in the text to be processed through the checking network to obtain sentence general scores of all sentences;

S202, determining a set of sentences with sentence passing scores smaller than a preset score threshold as a text to be corrected;

S203, performing evaluation processing on the text to be corrected by adopting an N-gram method, and determining target wrongly written characters in the text to be corrected.

The inspection network is understandably a neural network based on an N-gram language model algorithm. The process of checking the text to be processed by the checking network comprises two steps, namely, firstly, finding out sentences which possibly have errors through the smoothness recognition, namely, the text to be corrected, and then adopting an N-gram method to specifically locate target wrongly written words in the text to be corrected.

The implementation of the smoothness recognition can adopt various machine learning algorithms (such as naive bayes, support vector machines, deep learning and the like) to train the data set marked with the smoothness, so that the learning distinguishes the smoothness and the non-smoothness texts. Sentence-passing score is a score that characterizes the degree of smoothness of sentences in the text to be processed. The preset score threshold is a preset sentence-passing score threshold for judging whether sentences are passed or not. The localization of wrongly written words can be achieved using an N-gram language model algorithm, N-gram being an algorithm based on a statistical language model. The basic idea of N-gram is to perform a sliding window operation of size N (e.g., N is 3) on the content in the text according to bytes, forming a byte fragment sequence of length N. Each byte segment is called a gram, the occurrence frequency of all the grams is counted, and filtering is carried out according to a preset threshold value to form a key gram list. The N-gram method is a language model based on context, can consider collocation relation and frequency between adjacent words, and can determine wrongly written words according to the context relation of words in the text.

In one embodiment, the sentence-passing degree may be represented by a sentence-passing score of 0-1, where 0 represents sentence-non-passing and 1 represents sentence-passing. The preset score threshold is set to be 0.75, sentence passing recognition is conducted on sentences in the text to be processed through the checking network to obtain sentence passing scores of the sentences, and the sentence passing scores of the sentences are compared with the preset score threshold. When the sentence passing score of the sentence in the text to be processed is smaller than 0.75, indicating that the sentence is not passed, and then grammar or semantic errors exist in the sentence; when the sentence-passing score of the sentence in the text to be processed is greater than or equal to 0.75, the sentence-passing is indicated. And determining a set of sentences with the sentence general score smaller than a preset score threshold value as the text to be corrected in the sentences in the text to be processed. And (3) evaluating the text to be corrected by adopting an N-gram method, continuously taking 3 words as an N-gram when N is 3, sliding a window backwards by one word, taking 3 words as the next N-gram, and the like until the end of the text to be corrected, and determining the target misplaced word in the text to be corrected based on the collocation relation and the frequency between adjacent words.

The checking network of the embodiment determines the unparalleled sentence in the text to be processed as the text to be corrected through the smoothness recognition, and verifies the rationality and smoothness of the sentence structure. The checking network also processes the text to be corrected by an N-gram method, so that the wrongly written characters needing to be corrected can be rapidly and accurately determined.

In an embodiment, in step S30, that is, the replacing the target misplaced word based on the preset replacement dictionary, obtaining a dictionary replacement text corresponding to the text to be corrected includes:

S301, judging whether a to-be-selected replacement word matched with each target misplaced word can be found in the preset replacement dictionary;

S302, if a to-be-selected replacement word matched with the target misplaced word is found, carrying out replacement processing on the target misplaced word according to the to-be-selected replacement word to obtain a replaced text to be corrected;

S303, if the to-be-selected replacement word matched with the target misplaced word cannot be found, determining the target misplaced word as an unreplaced misplaced word;

S304, generating dictionary replacement text according to the replaced text to be corrected and all the non-replaced wrongly written characters.

After the checking network outputs the text to be corrected, the text to be corrected is not directly input into the correction network, but the to-be-selected replacement word matched with the target misplaced word is searched based on the preset replacement dictionary, and the target misplaced word is replaced by the to-be-selected replacement word to obtain the dictionary replacement text corresponding to the text to be corrected. The preset replacement dictionary consists of a plurality of groups of initial words and replacement words which correspond to each other, and the to-be-selected replacement word is a replacement word which is waiting to be selected in the preset replacement dictionary and is used for correcting the target misplaced word.

In an embodiment, when there are multiple target misplaced words in the text to be corrected, it is required to determine whether the candidate replacement words matching with each target misplaced word can be found in the preset replacement dictionary one by one. If the initial word consistent with the current target misplaced word can be found in the preset replacement dictionary, which indicates that the to-be-selected replacement word matched with the target misplaced word can be found, determining the replacement word corresponding to the initial word as the to-be-selected replacement word matched with the target misplaced word, and replacing the target misplaced word with the matched to-be-selected replacement word to obtain the replaced to-be-corrected text. If the initial word consistent with the current target misplaced word is not found in the preset replacement dictionary, indicating that the to-be-selected replacement word matched with the target misplaced word cannot be found, marking the current target misplaced word as an unreplaced misplaced word. After searching all target wrongly written words in the text to be corrected, combining the replaced text to be corrected and all the un-replaced wrongly written words to obtain dictionary replacement text. The dictionary replacement text contains both the correct word that has been replaced and the misplaced word that has not been replaced.

In another embodiment, when an initial word consistent with the current target misplaced word can be found in the preset replacement dictionary and only one replacement word corresponding to the initial word exists, the replacement word corresponding to the initial word is directly determined to be a candidate replacement word matched with the target misplaced word. When an initial word consistent with the current target misplaced word can be found in a preset replacement dictionary and a plurality of replacement words corresponding to the initial word exist, the word adjacent to the target misplaced word and each replacement word are required to be combined first to generate various possible combined words, and the combined words are subjected to general identification verification so as to increase the diversity of error correction. Replacement is also required. For example, one target mispronounced word in the text to be corrected is "obtained" of "obtained object", a group of dictionary data including initial word "obtained" and replacement words "goods", "accidents" and "confusion" corresponding to the initial word "obtained" exist in the preset replacement dictionary, the generated combination words "goods", "accidents" and "confusion" are subjected to passability recognition verification, and if the verification result is "goods", the passability is the highest, the "goods" is determined as the candidate replacement word matched with the target mispronounced word.

According to the method, the device and the system, the to-be-selected replacement word matched with the target misplaced word is searched through the preset replacement dictionary, the misplaced word in the to-be-corrected text is corrected through replacement when the to-be-selected replacement word is searched, the misplaced word is reserved when the to-be-selected replacement word is not searched, the integrity and the accuracy of the processing process are ensured, and the workload of subsequent correction is reduced.

In an embodiment, in step S30, after performing the replacing process on the target misplaced word based on the preset replacing dictionary to obtain the dictionary replacing text corresponding to the text to be corrected, the method further includes:

s305, if the dictionary replacement text does not contain the non-replaced wrongly written words, determining the dictionary replacement text as the text to be processed after error correction.

Understandably, after performing replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected, it is required to determine whether the dictionary replacement text has an unreplaced misplaced word. If the dictionary replacement text does not contain any misplaced word, indicating that all target misplaced words in the text to be corrected are replaced based on a preset replacement dictionary, and if the dictionary replacement text does not contain misplaced words to be corrected, determining the dictionary replacement text as corrected text to be processed, namely correct text corresponding to the text to be processed.

According to the method and the device for correcting the text error, when the fact that the dictionary replacement text does not contain the un-replaced wrongly written characters is judged, the dictionary replacement text is directly determined to be the text to be processed after error correction, repeated processing of the dictionary replacement text through an error correction network is avoided, and the efficiency of text error correction is improved.

In an embodiment, the error correction network comprises a Bert model; in step S40, that is, performing error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the non-replaced wrongly written word, including:

s401, carrying out mask processing on each non-replaced wrongly written word through the Bert model to obtain mask text;

S402, carrying out prediction processing on the mask text to obtain a plurality of initial candidate words corresponding to each unsubstituted wrongly written word;

S403, screening all the initial candidate words to determine a candidate word set corresponding to the non-replaced wrongly written word.

The error correction network is understandably a neural network based on the Bert model algorithm. The BERT (Bidirectional Encoder Representations from Transformers) model algorithm is a pre-training model, and the BERT uses a mask-based language model (Masked Language Model, MLM) in the training process, i.e. masking certain positions in the input sequence randomly, then predicting the masked positions by the model, and the predicted result can represent context information of the context. The process of performing error correction processing on the dictionary replacement text by the error correction network comprises two steps, namely firstly masking and predicting the error word by utilizing the Mask function of the BERT model to obtain candidate error correction suggestions with a context relation, namely initial candidate words, and then screening all the initial candidate words to determine a candidate word set corresponding to the non-replaced error word. Masking text refers to text that masks the non-replaced wrongly written words in the dictionary replacement text. The initial candidate words are the prediction results of the masked non-replaced wrongly written words output by the BERT model, and one non-replaced wrongly written word corresponds to a plurality of initial candidate words.

In one embodiment, after dictionary replacement text is input into the error correction network, the Bert model in the error correction network masks each of the non-replaced wrongly written words to generate masked text. And the Bert model predicts the mask text according to the context information to obtain a plurality of initial candidate words corresponding to each non-replaced wrongly written word. In order to facilitate the selection of a proper word from a plurality of initial candidate words to correct the non-substituted misplaced word, screening processing is required to be performed on all the initial candidate words (for example, each non-substituted misplaced word only retains a designated number of initial candidate words), and the error correction network outputs a candidate word set corresponding to the non-substituted misplaced word according to the screened initial candidate words.

The error correction network of the embodiment processes the non-replaced wrongly written characters through the Bert model to obtain initial candidate characters, so that candidate error correction suggestions with a context relationship can be generated, and the grammar accuracy of text error correction is improved. Meanwhile, the data size of the candidate word set is further reduced through screening of the initial candidate words, and the method is favorable for selecting proper candidate words subsequently.

In one embodiment, in step S403, the filtering processing is performed on all the initial candidate words to determine a candidate word set corresponding to the non-replaced wrongly written word, including:

S4031, acquiring a plurality of initial candidate words corresponding to any one of the non-replaced wrongly written words;

s4032, replacing the non-replaced wrongly written word according to each initial candidate word, and performing smoothness recognition processing on the replaced dictionary replacement text to obtain candidate word smoothness scores corresponding to each initial candidate word;

S4033, determining a candidate word group corresponding to the non-replaced misplaced word according to the candidate word passing score, wherein the candidate word group comprises a preset number of initial candidate words;

S4034, generating a candidate word set according to all candidate word groups corresponding to the non-replaced wrongly written words.

Understandably, the candidate word circulation score is a score that characterizes the circulation of dictionary replacement text after an unremoved misplaced word is replaced by the original candidate word. The candidate word group corresponding to the non-replaced misplaced word refers to a set of a preset number of candidate words selected from all initial candidate words corresponding to the current non-replaced misplaced word, and the preset number can be set by default (for example, 3) of the system, and can also be adjusted according to actual needs. One of the non-replaced wrongly written words corresponds to one of the candidate word groups, and the candidate word set includes all of the candidate word groups and correspondence between each of the non-replaced wrongly written words and each of the candidate word groups, so as to determine a target replacement word corresponding to the current non-replaced wrongly written word from the candidate word set according to the received candidate word selection instruction.

In an embodiment, dictionary replacement text is input to an error correction network, after the Bert model outputs a plurality of initial candidate words corresponding to each non-replaced wrongly written word, the error correction network does not directly output all the initial candidate words, but performs screening on the basis of the through recognition, and a candidate word set obtained after screening is used as an output of the error correction network. For each non-replaced wrongly written word, the error correction network acquires a plurality of initial candidate words corresponding to the current non-replaced wrongly written word output by the Bert model, replaces the current non-replaced wrongly written word according to each initial candidate word, and performs smoothness recognition processing on the replaced dictionary replacement text (such as using a trained language model or a semantic analysis tool to evaluate the smoothness of the text) to obtain candidate word smoothness scores corresponding to each initial candidate word. And sorting all the initial candidate words according to the score size according to the candidate word passing score, and when the preset number is 3, selecting the initial candidate words of the first 3 candidate words of the candidate word passing score sorting and determining the initial candidate words as candidate word groups corresponding to the non-replaced misplaced words. And obtaining each candidate word group corresponding to the non-replaced wrongly written word by adopting the same method, and generating a candidate word set according to all the candidate word groups.

The error correction network in this embodiment obtains candidate word passing score by means of the method of performing the passing recognition after the replacement of the non-replaced wrongly written word, and selects the candidate word set from the initial candidate words based on the candidate word passing score, so that the screening accuracy is improved, the quality of the candidate word set is ensured, and the grammar accuracy of text error correction is facilitated.

In one embodiment, in step S50, after generating the corrected text to be processed according to the target replacement word and the dictionary replacement text, the method further includes:

s501, acquiring a new corpus of the target field according to a preset time period;

s502, inputting the new corpus in the target field as a training sample into the preset error correction model, and adjusting model parameters of the inspection network and the error correction network to obtain an updated preset error correction model;

S503, extracting new words of the target field in the new corpus of the target field, and updating the preset replacement dictionary according to the new words of the target field to obtain an updated preset replacement dictionary.

Understandably, for the preset error correction model, besides correcting errors of the text to be processed by using the trained preset error correction model, a continuous learning and model optimization process is needed, so that the preset error correction model can continuously adapt to field changes and new data along with the time, and the effectiveness and accuracy of the model are maintained.

The preset time period is a preset fixed time interval, such as daily, weekly or monthly, and can be set to weekly by default by the system, and can also be adjusted according to actual requirements. The new corpus of the target domain is a new corpus generated in the target domain at the end of the current time period compared with the target domain at the end of the last time period, and the new corpus of the target domain is used for reflecting the development and change of the target domain in the current time period. For example, when the target domain is a marketing domain and the preset time period is one week, the system will automatically collect new corpus (e.g. new professional terms) of the marketing domain from different sources (e.g. news website, professional database, social media, etc.) every other week.

In an embodiment, the server inputs the collected new corpus of the target domain as a training sample into a preset error correction model, and adjusts parameters of the model (such as parameters of the inspection network and parameters of the error correction network) through training, thereby obtaining an updated preset error correction model. The adjustment of the model parameters is achieved by a certain training algorithm (e.g. gradient descent) so that the model can better adapt to the new training samples. In addition to adjusting model parameters, the server also needs to extract new target domain words (emerging vocabularies or new technical terms) of the target domain from the new target domain corpus, and add the new target domain words to the preset replacement dictionary for updating or expanding the existing preset replacement dictionary.

Based on a real-time updating mechanism, the network parameters of the preset error correction model are adjusted by periodically utilizing the new corpus in the target field, and the preset replacement dictionary is updated, so that the real-time adaptability of the model is ensured, the new change and vocabulary growth in the target field are adapted, and the accuracy and error correction effect of text error correction are improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a text correction apparatus is provided, which corresponds to the text correction method in the above embodiment one by one. As shown in fig. 2, the text error correction apparatus includes a model acquisition module 10, an inspection processing module 20, a dictionary processing module 30, an error correction processing module 40, and a replacement word determination module 50. The functional modules are described in detail as follows:

the model acquisition module 10 is configured to acquire a preset error correction model, where the preset error correction model is a neural network model that is obtained based on target domain corpus training and includes an inspection network and an error correction network;

the checking and processing module 20 is configured to input a text to be processed into the checking network, perform checking and processing on the text to be processed, and determine a text to be corrected, where the text to be corrected includes at least one target mispronounced word;

the dictionary processing module 30 is configured to obtain a preset replacement dictionary, and perform replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected;

The error correction processing module 40 is configured to, if there is an unreplaced wrongly written word in the dictionary replacement text, input the dictionary replacement text into the error correction network, and perform error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the unreplaced wrongly written word;

And the replacement word determining module 50 is configured to determine, when a candidate word selection instruction is received, a target replacement word corresponding to the non-replaced wrongly written word in the candidate word set according to the candidate word selection instruction, and generate a text to be processed after error correction according to the target replacement word and the dictionary replacement text.

In one embodiment, the inspection processing module 20 includes:

The sentence-passing recognition unit is used for carrying out passing recognition on all sentences in the text to be processed through the checking network to obtain sentence passing scores of all sentences;

the text to be corrected determining unit is used for determining a set of sentences with sentence passing scores smaller than a preset score threshold value as a text to be corrected;

And the target mispronounced word determining unit is used for evaluating the text to be corrected by adopting an N-gram method and determining the target mispronounced word in the text to be corrected.

In one embodiment, dictionary processing module 30 includes:

The to-be-selected replacement word searching unit is used for judging whether to find to-be-selected replacement words matched with each target misplaced word in the preset replacement dictionary;

The to-be-selected replacement word replacing unit is used for replacing the target misplaced word according to the to-be-selected replacement word if the to-be-selected replacement word matched with the target misplaced word is found, so that a replaced text to be corrected is obtained;

The non-replacement wrongly written word determining unit is used for determining the target wrongly written word as the non-replacement wrongly written word if the to-be-selected replacement word matched with the target wrongly written word cannot be found;

and generating dictionary replacement text according to the replaced text to be corrected and all the non-replaced wrongly written characters.

In one embodiment, dictionary processing module 30 further includes:

And the corrected text determining unit is used for determining the dictionary replacement text as the corrected text to be processed if the dictionary replacement text does not contain the non-replaced wrongly written characters.

In one embodiment, the error correction processing module 40 includes:

the mask processing unit is used for carrying out mask processing on each unsubstituted wrongly written word through the Bert model to obtain mask text;

the prediction processing unit is used for performing prediction processing on the mask text to obtain a plurality of initial candidate words corresponding to each unsubstituted wrongly written word;

and the candidate word set determining unit is used for screening all the initial candidate words and determining a candidate word set corresponding to the non-replaced wrongly written word.

In one embodiment, the error correction processing module 40 further includes:

An initial candidate word obtaining unit, configured to obtain a plurality of initial candidate words corresponding to any one of the non-replaced wrongly written words;

The initial candidate word replacement unit is used for replacing the non-replaced misplaced word according to each initial candidate word, and performing universal recognition processing on the replaced dictionary replacement text to obtain candidate word universal score corresponding to each initial candidate word;

a candidate word group determining unit, configured to determine a candidate word group corresponding to the non-replaced misplaced word according to the candidate word passing score, where the candidate word group includes a preset number of initial candidate words;

And the candidate word set generating unit is used for generating a candidate word set according to all candidate word groups corresponding to the non-replaced wrongly written words.

In one embodiment, the replacement word determination module 50 includes:

the new corpus acquisition unit is used for acquiring the new corpus of the target field according to a preset time period;

The preset error correction model updating unit is used for inputting the new corpus in the target field as a training sample into the preset error correction model, and adjusting model parameters of the inspection network and the error correction network to obtain an updated preset error correction model;

And the preset replacement dictionary updating unit is used for extracting new target domain words in the new target domain corpus, and updating the preset replacement dictionary according to the new target domain words to obtain an updated preset replacement dictionary.

For specific limitations of the text correction apparatus, reference may be made to the above limitations of the text correction method, and no further description is given here. The respective modules in the above text error correction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a readable storage medium. The database of the computer device is used for storing data related to the text error correction method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions when executed by a processor implement a text error correction method. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided that includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, when executing the computer readable instructions, performing the steps of:

In one embodiment, one or more computer-readable storage media are provided having computer-readable instructions stored thereon, the readable storage media provided by the present embodiment including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions which when executed by one or more processors perform the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-volatile readable storage medium or a volatile readable storage medium, which when executed may comprise the above described embodiment methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method for text correction, comprising:

2. The text correction method of claim 1, wherein the performing an inspection process on the text to be processed to determine the text to be corrected includes:

Through the checking network, carrying out general identification on all sentences in the text to be processed to obtain sentence general scores of all sentences;

determining a set of sentences with sentence passing scores smaller than a preset score threshold as a text to be corrected;

And evaluating the text to be corrected by adopting an N-gram method, and determining target wrongly written characters in the text to be corrected.

3. The text correction method according to claim 1, wherein the replacing the target misplaced word based on the preset replacement dictionary to obtain the dictionary replacement text corresponding to the text to be corrected includes:

Judging whether a to-be-selected replacement word matched with each target misplaced word can be found in the preset replacement dictionary;

if the to-be-selected replacement word matched with the target misplaced word is found, carrying out replacement processing on the target misplaced word according to the to-be-selected replacement word to obtain a replaced text to be corrected;

if the to-be-selected replacement word matched with the target misplaced word cannot be found, determining the target misplaced word as an unreplaced misplaced word;

4. The text correction method according to claim 1, wherein after performing replacement processing on the target misplaced word based on the preset replacement dictionary to obtain a dictionary replacement text corresponding to the text to be corrected, the method further comprises:

and if the dictionary replacement text does not contain the non-replaced wrongly written characters, determining the dictionary replacement text as the text to be processed after error correction.

5. The text error correction method of claim 1, wherein the error correction network comprises a Bert model;

Performing error correction processing on the dictionary replacement text to obtain a candidate word set corresponding to the non-replaced wrongly written word, including:

masking each unsubstituted wrongly written word by the Bert model to obtain masking text;

Performing prediction processing on the mask text to obtain a plurality of initial candidate words corresponding to each of the non-replaced wrongly written words;

and screening all the initial candidate words to determine a candidate word set corresponding to the non-replaced wrongly written word.

6. The text error correction method of claim 5, wherein said filtering all of said initial candidate words to determine a set of candidate words corresponding to said non-replaced wrongly written word comprises:

acquiring a plurality of initial candidate words corresponding to any one of the non-replaced wrongly written words;

Replacing the non-replaced wrongly written word according to each initial candidate word, and performing universal recognition processing on the replaced dictionary replacement text to obtain candidate word universal score corresponding to each initial candidate word;

determining a candidate word group corresponding to the non-replaced misplaced word according to the candidate word passing score, wherein the candidate word group comprises a preset number of initial candidate words;

and generating a candidate word set according to all the candidate word groups corresponding to the non-replaced wrongly written words.

7. The text correction method of claim 1, further comprising, after generating corrected pending text from the target replacement word and the dictionary replacement text:

acquiring a new corpus of the target field according to a preset time period;

Inputting the new corpus in the target field as a training sample into the preset error correction model, and adjusting model parameters of the inspection network and the error correction network to obtain an updated preset error correction model;

extracting new words of the target field in the new corpus of the target field, and updating the preset replacement dictionary according to the new words of the target field to obtain an updated preset replacement dictionary.

8. A text error correction apparatus, comprising:

9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the text error correction method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the text error correction method of any of claims 1-7.