CN111859921A

CN111859921A - Text error correction method and device, computer equipment and storage medium

Info

Publication number: CN111859921A
Application number: CN202010650353.1A
Authority: CN
Inventors: 吕海峰; 宁义双; 宁可
Original assignee: Kingdee Software China Co Ltd
Current assignee: Kingdee Software China Co Ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-30
Anticipated expiration: 2040-07-08
Also published as: CN111859921B

Abstract

The application relates to a text error correction method, a text error correction device, a computer device and a storage medium. The method comprises the following steps: acquiring a text sentence to be corrected; determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement; according to the N-element grammar probability set, identifying suspected wrong words in the text sentences; acquiring a candidate correction word set corresponding to the suspected error word; and according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with a corresponding target correction word to obtain a corrected text statement. By adopting the method, the accuracy of text error correction can be improved.

Description

Text error correction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology and natural language processing technology, and in particular, to a text error correction method, apparatus, computer device, and storage medium.

Background

With the development of natural language processing technology, text error correction technology has emerged, which has important applications, such as: in a text obtained by speech recognition, errors such as homophones, similar-tone characters, wrongly written characters and the like often occur due to the influence of factors such as environment, accent, equipment and the like, and therefore, it is necessary to correct the wrongly written characters in the text by a text error correction technique.

In the conventional technology, text error correction is generally required through word segmentation, however, if errors occur in word segmentation results, recognition results of wrong words or wrong words in the text are easily affected, and therefore accuracy of text error correction is reduced.

Disclosure of Invention

In view of the above, it is necessary to provide a text error correction method, apparatus, computer device and storage medium capable of improving the accuracy of text error correction.

A method of text correction, the method comprising:

acquiring a text sentence to be corrected;

determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;

According to the N-element grammar probability set, identifying suspected wrong words in the text sentences;

acquiring a candidate correction word set corresponding to the suspected error word;

and according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with a corresponding target correction word to obtain a corrected text statement.

In one embodiment, the method further comprises:

determining a similarity between the corrected text sentence and each document in the positive corpus;

according to the similarity, the documents are sorted in the sequence of the similarity from big to small, and a preset number of documents are selected as candidate documents;

forming a candidate dictionary according to each candidate document;

and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.

In one embodiment, the determining the N-gram probability set of the text sentence by using the N-gram language model trained based on the pre-constructed positive corpus comprises:

Determining an N-element grammar set of the text sentence to be corrected;

determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus;

and determining the N-element grammar probability corresponding to each word in the text statement according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text statement.

In one embodiment, the identifying the suspected incorrect word in the text sentence according to the N-gram probability set includes:

determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set;

determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error;

and when the N-gram probability in the N-gram probability set is smaller than the average value and the probability critical value corresponding to the N-gram probability is larger than a preset threshold value, determining that the word corresponding to the N-gram probability in the text sentence is a suspected wrong word.

In one embodiment, the obtaining the set of candidate correction words corresponding to the suspected error word includes:

Determining a candidate word set corresponding to the suspected wrong word; the pinyin of each candidate word in the candidate word set is the same as or similar to the pinyin of the suspected wrong word;

according to word frequency of candidate words in the candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency;

and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of the suspected wrong characters.

In one embodiment, the screening, according to the N-gram language model, a target correction word corresponding to the suspected erroneous word from the candidate correction word set, and replacing each suspected erroneous word in the text sentence with a corresponding target correction word to obtain a corrected text sentence includes:

respectively replacing each candidate corrected word in the candidate corrected word set corresponding to the suspected erroneous word to be corrected with the suspected erroneous word in the text statement to be corrected to obtain a candidate text statement set corresponding to the suspected erroneous word;

Respectively determining the confusion degree of each candidate text statement in the candidate text statement set through the N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is obtained after the suspected wrong words are corrected and replaced by the target corrected words.

In one embodiment, the suspected erroneous word is multiple;

the screening, according to the N-gram language model, target corrected words corresponding to the suspected erroneous words from the candidate corrected word set, and replacing each suspected erroneous word in the text sentence with a corresponding target corrected word to obtain a corrected text sentence further includes:

and re-selecting suspected wrong words from the suspected wrong words in the corrected text sentence corresponding to the last suspected wrong word as suspected wrong words to be corrected, taking the corrected text sentence corresponding to the last suspected wrong word as a text sentence to be corrected, executing each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong word to be corrected, respectively replacing the suspected wrong words in the text sentence to be corrected so as to continue executing until the suspected wrong words do not exist in the corrected text sentence, and obtaining the final corrected text sentence.

In one embodiment, the training step of the N-gram language model includes:

constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction;

preprocessing the positive corpus; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of a language model training tool;

and training and generating the N-element language model through the language model training tool according to the preprocessed positive language database.

A text correction apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text sentence to be corrected;

the probability determination module is used for determining an N-gram probability set of the text statement through an N-gram language model trained on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;

the suspected wrong word identification module is used for identifying the suspected wrong words in the text sentences according to the N-element grammar probability set;

a candidate corrected word determining module, configured to obtain a candidate corrected word set corresponding to the suspected erroneous word;

And the error correction module is used for screening target correction words corresponding to the suspected error words from the candidate correction word set according to the N-element language model, and replacing each suspected error word in the text statement with a corresponding target correction word to obtain a corrected text statement.

A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of the text correction method according to embodiments of the present application.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of a text correction method as described in embodiments of the present application.

According to the text error correction method, the text error correction device, the computer equipment and the storage medium, the N-gram probability set of the text sentence to be corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, the N-gram probability set comprises the N-gram probability of each word in the text sentence, and therefore suspected wrong words in the text sentence can be recognized according to the N-gram probability set, wrong word recognition is conducted on the word level, and wrong word recognition is conducted on the word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved.

Drawings

FIG. 1 is a flow diagram illustrating a text correction method according to one embodiment;

FIG. 2 is a flow diagram illustrating error correction result recall in one embodiment;

FIG. 3 is a schematic flow chart illustrating the determination of probability of a binary grammar in one embodiment;

FIG. 4 is a flow diagram illustrating the identification of suspected miswords in a textual statement in one embodiment;

FIG. 5 is a flowchart illustrating an embodiment of error correction for suspected erroneous words;

FIG. 6 is a schematic diagram illustrating an overall flowchart of a text correction method according to an embodiment;

FIG. 7 is a block diagram showing the structure of a text correction apparatus according to an embodiment;

FIG. 8 is a block diagram showing the structure of a text correction apparatus in another embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a text error correction method is provided, and this embodiment is illustrated by applying the method to a server, and it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

S102, obtaining a text sentence to be corrected.

The text sentence is a sentence in text form. The text sentence to be corrected is a text sentence to be corrected, that is, the text sentence to be corrected contains a word with an error (for example, a homophone word, a similar-tone word or a wrongly written word).

In one embodiment, the text sentence to be corrected may be a text sentence obtained by converting speech into text information through a speech recognition technique. It can be understood that in the speech recognition process, due to the influence of environmental noise, device performance, accent and other factors, the obtained text sentence may contain incorrect words such as homophones, similar-tone words or wrongly-written words.

In another embodiment, the text sentence to be corrected may also be a text sentence obtained by manual input. It is understood that in the process of manually inputting text, the text sentence may contain wrong words due to human input errors and the like.

In other embodiments, the text sentence to be corrected may also be a text sentence obtained through other approaches, which is not limited.

S104, determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-gram probability set comprises N-gram probabilities of each word in the text sentence.

The positive corpus is composed of text corpora without error correction. The text corpus without error correction refers to a text corpus without error characters or words. The positive corpus contains a plurality of documents. An N-gram language model, which is a statistical-based language model. N in the N-element language model can take any positive integer, and can generally take 2 or 3. Such as: when the value of N is 2, the model is a 2-gram model; and when the value of N is 3, the model is a 3-gram model. The N-gram probability is the probability corresponding to each word in the text sentence obtained through the N-gram language model after the text sentence is converted into the form of the N-gram. Words, refer to individual words rather than words. Such as: "I" and "s" are separate words, respectively, and "we" is a word.

In one embodiment, the positive corpus may be composed of domain-specific text corpora that do not require error correction. Aiming at the error correction of the text sentences in the specific field, the positive corpus belonging to the field is constructed, so that the search range of the error correction process is more targeted, and the accuracy of the error correction can be improved. Such as: the specific field can be an intelligent customer service robot scene, the quality of the voice recognition text in the robot conversation process can be improved, the text after the quality improvement can improve the accuracy of the robot intention recognition in the specific field application, and further the fluency of the robot conversation process is improved. In other embodiments, the positive corpus may also include text corpora of multiple domains, not necessarily specific domains.

Specifically, the server may convert the text statement into an N-gram form, and then determine an N-gram probability corresponding to each word in the text statement through the N-gram language model.

And S106, identifying suspected wrong words in the text sentence according to the N-element grammar probability set.

The suspected wrong words are words which are determined according to the N-element grammar probability set and possibly have errors in the text sentences. It is understood that the number of suspected miswords in the text sentence may be one or more.

Specifically, the server may determine the suspected incorrect word in the text sentence according to the N-gram probability corresponding to each word in the text sentence in the N-gram probability set.

And S108, acquiring a candidate correction word set corresponding to the suspected error word.

The correction word is a word obtained by correcting the suspected erroneous word, that is, a word used for replacing the suspected erroneous word in the text sentence. The candidate corrected word is a candidate corrected word. The candidate corrected word set is a set composed of a plurality of candidate corrected words.

Specifically, the server may first determine a plurality of candidate words that are the same as or similar to the suspected incorrect word pinyin, and then select a candidate corrected word from the candidate words according to the word frequency of each candidate word in the positive corpus.

S110, according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with the corresponding target correction word to obtain the corrected text statement.

The target corrected word is a corrected word which is finally selected from the candidate corrected word set and is used for replacing a suspected error word in the text statement. The corrected text statement is obtained by replacing each suspected error word in the text statement to be corrected with a corresponding target correction word.

In one embodiment, when the number of suspected erroneous words in the text sentence to be corrected is one, the server directly determines the target corrected words corresponding to the suspected erroneous words, and replaces the suspected erroneous words in the text sentence with the corresponding target corrected words to obtain the corrected text sentence.

In another embodiment, when the number of suspected erroneous words in the text sentence to be corrected is multiple, the server sequentially determines the target corrected words corresponding to each suspected erroneous word, and in the process of determining the target corrected words, the text sentence obtained after replacing the previous suspected erroneous words with the corresponding target corrected words is used as the current text sentence to be corrected to determine the target corrected words corresponding to the current suspected erroneous words and replace the current suspected erroneous words with the corresponding target corrected words, so that each suspected erroneous word is cyclically replaced until each suspected erroneous word in the text sentence is replaced with the corresponding target corrected word, and the final corrected text sentence is obtained.

In the text error correction method, the N-gram probability set of the text sentence to be error-corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, wherein the N-gram probability set comprises the N-gram probability of each word in the text sentence, so that suspected wrong words in the text sentence can be recognized according to the N-gram probability set, wrong word recognition is performed on a word level, and wrong word recognition is performed on a word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved. In addition, different from a method needing to construct a negative corpus, the method only needs to construct a positive corpus, so that the problem that the negative corpus (namely, text sentences containing wrong characters or words) is difficult to predict is avoided, and the practicability of the method is improved.

In one embodiment, the method further comprises the steps of: determining the similarity between the corrected text sentences and each document in the corpus; according to the similarity, sorting the documents according to the sequence of the similarity from big to small, and selecting the documents with the preset number as candidate documents; forming a candidate dictionary according to the candidate documents; and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.

The candidate documents are selected from the documents in the positive corpus according to the similarity and are used for forming the candidate dictionary. And the candidate dictionary is used for judging whether the target correction words replaced by the suspected error words are accurate, namely judging whether the text correction result is accurate.

In an embodiment, the preset number of the selected documents with the previous preset number may be set in advance according to an actual situation, and the specific number is not limited.

In one embodiment, the server may employ the BM25 algorithm (an algorithm for determining the relevance between search terms and documents) to determine the similarity between the corrected text statement and each document in the positive corpus. It can be understood that the larger the BM25 score obtained by the BM25 algorithm, the greater the correlation between the corrected text sentence and the document corresponding to the BM25 score, and the greater the similarity. Conversely, the smaller the BM25 score obtained by the BM25 algorithm, the smaller the similarity between the corrected text sentence and the document corresponding to the BM25 score. In other embodiments, the server may also use other methods to determine the similarity between the corrected text sentence and each document in the positive corpus, which is not limited.

In one embodiment, the candidate dictionary may be composed of all words in each candidate document.

In one embodiment, the server may detect whether a target correction word, in which each suspected error word in the text sentence is replaced, exists in the candidate dictionary, and then recall or confirm the target correction word according to the detection result. For each target correction word, when the target correction word does not exist in the candidate dictionary, it indicates that error correction is wrong, and the target correction word is restored to a corresponding suspected error word before error correction (i.e. error correction result recall); when the target correction word exists in the candidate dictionary, the error correction is correct, and the target correction word is not restored. And outputting the finally obtained text sentence as the text sentence after the target correction words corresponding to all the suspected error words in the text sentence are detected and the text sentence is recalled according to the detection result or no error is confirmed.

As shown in fig. 2, as a flow diagram for recalling error correction results, first, BM25 scores of each document in the corrected text sentence and the positive corpus are calculated, then the documents are sorted from large to small according to BM25 scores, a preset number of documents before are output as candidate documents, a candidate dictionary is generated according to the candidate documents, then whether a target correction word exists in the candidate dictionary is judged, if yes, the error correction is correct, and the corrected text sentence is output; if not, error correction is carried out, and the target correction word is restored to a suspected error word before error correction.

In this embodiment, according to the similarity between the corrected text sentence and each document in the positive corpus, candidate documents are determined, and then a candidate dictionary is formed, whether error correction result recall is performed is determined according to whether the target corrected word replaced by the suspected erroneous word exists in the candidate dictionary, and the error correction result with errors in error correction is recalled, so that the misjudgment rate of text error correction can be reduced, and the accuracy of text error correction is improved.

In one embodiment, the step S104 determines the N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus, and specifically includes the following steps: determining an N-element grammar set of a text sentence to be corrected; determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus; and determining the N-element grammar probability corresponding to each word in the text sentence according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text sentence.

The N-gram set is a set composed of candidate items in which a text sentence to be corrected is expressed in the form of an N-gram.

In one embodiment, the server performs a sliding window operation with a size of N on the content in the text sentence according to characters, and generates a plurality of character segments with a length of N, where each character segment is a candidate item in the N-gram set. For example: assuming that the text sentence is "tomorrow goes to Beijing for business trip", and the selected N is 2, the binary grammar set of the text sentence is [ "tomorrow", "north removal", "Beijing out", "business trip" ].

In one embodiment, the server may load an N-gram language model and then sequentially determine an N-gram probability for each candidate in the N-gram set based on the model file of the N-gram language model. Aiming at each candidate item in the N-element grammar set, firstly, directly searching the candidate item in a model file, and when the candidate item is searched, determining the N-element grammar probability corresponding to the candidate item according to the value in the probability field corresponding to the candidate item in the model file; when the candidate item is not found, the candidate item is split into a form of N-1 meta grammar to obtain a plurality of split sub candidate items, the sub candidate items are found in the model file, and when at least one of the sub candidate items is found, the probability of the N meta grammar corresponding to the candidate item (namely the candidate item before splitting) is determined according to the value in the probability field corresponding to each sub candidate item found in the model file; and when any sub candidate item is not found, splitting the candidate item into N-2 meta-grammars, searching in the model file, and so on until the split sub candidate item is found, and determining the probability of the N meta-grammars corresponding to the candidate item according to the value in the probability field corresponding to each searched sub candidate item in the meta-grammars in the model file.

For example: assuming that the selected N is 2, and the binary grammar set of the text sentence is [ "tomorrow", "Tian go", "north removal", "beijing", "shang go", or "business turn" ], the determination step of the binary grammar probability of the candidate item "tomorrow" in the binary grammar set is:

s1, loading a binary language model, wherein the model file of the binary language model comprises three fields of pro, word and back _ pro.

S2, directly searching tomorrow in the model file, when tomorrow is found, determining the value of pro field corresponding to tomorrow in the model file as the binary grammar probability of tomorrow, and ending the search. When "tomorrow" is not found, step S3 is performed.

S3, dividing the Mingtian into the form of unary grammar, namely Ming and day, and then respectively searching the Mingtian and the day in the model file. When both "bright" and "day" are found, the product of the value of the back _ pro field corresponding to "bright" and the value of the pro field corresponding to "day", i.e., back _ pro ("bright"). pro ("day") in the model file is determined as the binary grammar probability of "bright day". When only "bright" is found and "day" is not found, back _ pro ("bright"). pro ("unk") is determined as the binary grammar probability of "bright day", that is, the binary grammar probability of the candidate item "bright day" before splitting is determined according to the value of the back _ pro field corresponding to the sub candidate item "bright" after splitting. When only "day" is found, and "clear" is found, back _ pro ("unk"). pro ("day") is determined as the binary grammar probability of "clear day", that is, the binary grammar probability of the candidate item "clear day" before splitting is determined according to the value of the pro field corresponding to the sub-candidate item "day" after splitting. Here, unk denotes an unregistered word.

It can be understood that the determination steps of the binary grammar probabilities of other candidate items in the binary grammar set [ "tomorrow", "Tian go", "De-North", "Beijing", "Jing go", "business out" ] are the same as above.

Fig. 3 is a schematic flow chart of determining the probability of the bigram for each candidate in the bigram set through the bigram language model, and the specific process is the same as the above example.

In an embodiment, the server may obtain the N-gram probability set of the text sentence by performing ending completion on a set composed of the obtained N-gram probabilities of each candidate item, and then averaging neighboring items to determine the N-gram probability corresponding to each word in the text sentence. For example: assuming that the set of binary grammar probabilities corresponding to each candidate in the binary grammar set [ "tomorrow", "Tian De", "De-North", "Beijing out", "out" ] is [ a, b, c, d, e, f ], where each element represents the binary grammar probability corresponding to each candidate, the server completes the ending of the pair [ a, b, c, d, e, f ] (the first binary grammar probability may be added at the head, i.e. a is added before a, and the last binary grammar probability is added at the tail, i.e. e is added after e) to obtain [ a, a, b, c, d, e, f, f ], and then averages the neighboring items in [ a, a, b, c, d, e, f, f ], i.e., [ (a + a)/2, (a + b)/2, (b + c)/2, each element in [ q, w, e, r, t, u, i ] obtained by (c + d)/2, (d + e)/2, (e + f)/2] ═ q, w, e, r, t, u, i ] respectively represents a binary grammar probability corresponding to each word in the text statement "tomorrow to beijing business turn over", and [ q, w, e, r, t, u, i ] is a binary grammar probability set of the text statement "tomorrow to beijing business turn over".

In the embodiment, the N-gram probability set of the text sentence is determined based on the N-gram language model trained by the pre-constructed positive corpus, so that the suspected wrong words in the text sentence can be identified according to the N-gram probability set, the wrong words are identified at the word level without word segmentation, the wrong words are identified at the word level, the influence on the accuracy of an error correction result due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved.

In one embodiment, the step S106 is to identify a suspected incorrect word in the text sentence according to the N-gram probability set, and specifically includes the following steps: determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set; determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error; and when the N-element grammar probability in the N-element grammar probability set is smaller than the average value and the probability critical value corresponding to the N-element grammar probability is larger than a preset threshold value, judging that the word corresponding to the N-element grammar probability in the text sentence is a suspected error word.

Where the Absolute error (AD) is the Absolute value of the Deviation of all individual observations from the arithmetic mean. Mean Absolute Deviation (MAD), is the average of Absolute errors, i.e., the average of the Absolute values of the deviations of all individual observations from the arithmetic Mean. The average absolute error can avoid the problem of mutual offset of errors, so that the size of the actual prediction error can be accurately reflected. And a probability threshold value used for indicating the possibility that the corresponding word in the text sentence is a suspected wrong word. The larger the probability threshold, the greater the probability that the corresponding word in the text sentence is a suspected erroneous word. The smaller the probability threshold, the less likely the corresponding word in the text sentence is to be a suspected wrong word, i.e., the more likely it is to be a correct word. It is understood that the probability threshold obtained from the absolute error and the average absolute error is a probability threshold corresponding to each N-gram probability in the N-gram probability set, and each N-gram probability in the N-gram probability set corresponds to each word in the text sentence, so that the obtained probability threshold can correspond to each word in the text sentence.

In one embodiment, the server may increase each N-gram probability in the N-gram probability set by one dimension, and then determine the absolute error of the N-gram probability in the N-gram probability set according to the N-gram probability increased by one dimension. Each item contained in the obtained absolute error corresponds to each N-element grammar probability in the N-element grammar probability set respectively and also corresponds to each word in the text sentence respectively. It can be understood that each N-gram probability in the N-gram probability set is added with one dimension for calculating an absolute error, and in other embodiments, the N-gram probability sets are all N-gram probability sets without one dimension added.

For example: assuming that the set of N-gram probabilities is [ q, w, e, r, t, u, i ], each of the N-gram probabilities is increased by one dimension of [ [ q ], [ w ], [ e ], [ r ], [ t ], [ u ], [ i ], [ sc ], [ q ], [ w ], [ e ], [ r ], [ t ], [ u ], [ i ], the average value of sc is [ avg ], [ q + w + e + r + t + u + i ], [ m ], the absolute error AD is [ sum (i-avg) [ 2for i in sc ], [ c1, c2, c3, c4, c5, c6, c7], that is, the sum of the squares of the differences between each term of sc and the average value avg. The AD comprises 7 items, wherein each item corresponds to each item in the probability set [ q, w, e, r, t, u, i ] of the N-element grammar respectively, and also corresponds to each word in a text statement 'going to Beijing on tomorrow' respectively. The average absolute error MAD is the average of the terms in AD, i.e., MAD ═ c1+ c2+ c3+ c4+ c5+ c6+ c 7/7. It is understood that the average value avg of one dimension is added, and is also used for calculating the absolute error, and in other embodiments, the average value avg is the average value without adding one dimension.

In one embodiment, the server may determine the set of probability thresholds according to the following formula:

y＝ratio*AD/MAD；

wherein y is a probability critical value set, ratio is a preset hyperparameter, AD is an absolute error, and MAD is an average absolute error. It can be understood that, since the number of terms in AD coincides with the number of words in the text sentence, the number of terms included in the probability threshold set obtained according to the above formula coincides with the number of terms in AD. For example: if 7 entries are included in AD [ c1, c2, c3, c4, c5, c6, c7], then 7 entries are also included in the corresponding set y of probability threshold values, where each probability threshold value corresponds to each word in the text sentence.

In one embodiment, when the N-gram probability in the N-gram probability set is smaller than the average value (the average value of the N-gram probabilities in the N-gram probability set) and the probability threshold corresponding to the N-gram probability is greater than the preset threshold, it is determined that the word corresponding to the N-gram probability in the text sentence is the suspected incorrect word. And the words corresponding to the N-element grammar probabilities which do not meet the conditions in the text sentences are not suspected wrong words.

In one embodiment, the server may determine a corresponding index (index, i.e., a position in the text sentence) of the suspected wrong word in the text sentence according to the N-gram probability satisfying the condition for determining the suspected wrong word.

Fig. 4 is a schematic flow chart illustrating a process of identifying suspected wrong words in a text sentence according to an N-gram probability set. Firstly, determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set, then determining a probability critical value, detecting suspected wrong words according to two judgment conditions that the probability of the N-element grammar is smaller than the average value and the probability critical value is larger than a preset threshold value and the probability critical value is 1, and when the probability of the N-element grammar simultaneously meets the two conditions, the corresponding words of the probability of the N-element grammar in the text sentences are suspected wrong words, and outputting indexes of the suspected wrong words in the text sentences.

In the embodiment, the suspected wrong words in the text sentence are determined according to the N-gram probabilities in the N-gram probability set, so that the wrong words are identified on the word level without word segmentation, and the error word identification is performed on the word level, so that the influence on the accuracy of an error correction result due to the error word segmentation can be avoided, and the accuracy of text error correction can be improved.

In one embodiment, the step S108 of obtaining the candidate corrected word set corresponding to the suspected erroneous word includes the following steps: determining a candidate word set corresponding to the suspected wrong word; the pinyin of each candidate character in the candidate character set is the same as or similar to the pinyin of the suspected wrong character; according to the word frequency of the candidate words in the candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency; and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of suspected wrong characters.

The word frequency is used for representing the frequency of the words in the standard word dictionary appearing in the positive corpus.

In one embodiment, the pinyin Chinese character comparison table dictionary may be constructed in advance according to the collected pinyin Chinese character comparison tables of common Chinese characters (e.g., 3500 common Chinese characters). The pinyin-hanzi lookup table dictionary may contain a plurality of pinyins and a plurality of hanzi corresponding to each pinyin. The server can convert the suspected wrong character into the corresponding pinyin, then look up the pinyin and the pinyin similar to the pinyin in the pinyin Chinese character comparison table dictionary, determine the Chinese character corresponding to the found pinyin according to the pinyin Chinese character comparison table dictionary, and form the Chinese character corresponding to the found pinyin into a candidate character set. It can be understood that, because the found pinyin is the pinyin of the suspected wrong word and the pinyin similar to the pinyin of the suspected wrong word, the pinyin of each candidate word in the candidate word set is the same as or similar to the pinyin of the suspected wrong word.

In one embodiment, the server may determine similar pinyins according to a preset threshold or according to an initial equal decision condition. Such as: the Pinyin of Beijing is jing, the Pinyin of Tianjin is jin, and the two Pinyin can be used as similar Pinyin.

In one embodiment, the pinyin-hanzi lookup table dictionary may be in the form of: { …, "shi": is a real-time commercial ten-way representation enabling the world instructor to recognize the beginning of losing the opportunity to apply Shishishi, releasing the Wet going dead and shiy Shimi, … }, wherein ellipses represent other pinyins, and corresponding Chinese characters, in a form consistent with the form of shi shown, not shown here.

In one embodiment, the words in the standard word dictionary may be arranged in descending order of the corresponding word frequency. The standard word dictionary may be expressed in the form of { word: word frequency }, such as: { …, "Ming": 30, "day": 15, … }, wherein the ellipses are other words and corresponding words not shown, not shown here.

In an embodiment, the server may sequentially search each candidate word in the candidate word set in the standard word dictionary, determine a word frequency corresponding to the searched candidate word in the standard word dictionary, and then sort the candidate words according to the word frequency from large to small.

In an embodiment, the preset number of the selected candidate correction words with the first preset number may be preset according to an actual situation, and the specific number is not limited.

In this embodiment, candidate words that are the same as or similar to the pinyin of the suspected incorrect word are determined, and then the candidate corrected words are selected according to the word frequency of each candidate word corresponding to the standard word dictionary, so that the candidate corrected words can be determined more accurately.

In one embodiment, step S110 includes: respectively replacing suspected wrong words in the text sentences to be corrected with each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong words to obtain candidate text sentence sets corresponding to the suspected wrong words; respectively determining the confusion degree of each candidate text statement in the candidate text statement set through an N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is the text statement obtained after the suspected wrong words are corrected and replaced by the target corrected words.

The suspected wrong word to be corrected is the suspected wrong word to be corrected currently. And the candidate text sentences are candidate text sentences obtained by correcting the suspected wrong words to be corrected. That is, the candidate text statement is a text statement obtained by replacing the suspected erroneous word to be corrected currently in the text statement to be corrected currently with the candidate corrected word. It is understood that the candidate text sentences in the candidate text sentence set correspond to the candidate corrected words in the candidate corrected word set one by one. Perplexity (Perplexity) is an index used in the Natural Language Processing (NLP) field to measure the quality of a language model. The smaller the confusion of a candidate text sentence, the greater the probability that the candidate text sentence is indicated, indicating that the candidate text sentence is more correct.

In one embodiment, the server may use the candidate text sentence with the smallest confusion as the corrected text sentence corresponding to the suspected wrong word currently to be corrected.

It can be understood that, when there is one suspected incorrect word in the text sentence to be corrected, the correction processing in this embodiment may be performed to obtain a corrected text sentence corresponding to the suspected incorrect word.

In this embodiment, the confusion degree of each candidate text statement is obtained through the N-gram language model, and then the corrected text statement is selected according to the confusion degree, so that the suspected wrong word is corrected, and the corrected text statement can be determined more accurately and conveniently.

In one embodiment, the suspected erroneous word is plural. Step S110 further includes: and re-selecting the suspected wrong words from the suspected wrong words in the corrected text sentence corresponding to the last suspected wrong word as the suspected wrong words to be corrected, taking the corrected text sentence corresponding to the last suspected wrong word as the text sentence to be corrected, executing each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong word to be corrected, respectively replacing the suspected wrong words in the text sentence to be corrected so as to continue executing the operation until no suspected wrong words exist in the corrected text sentence, and obtaining the final corrected text sentence.

It can be understood that the last suspected wrong word refers to a suspected wrong word corrected in the text sentence to be corrected.

It will be appreciated that the essence of the above described embodiment is: when the suspected error words in the text sentence to be corrected are multiple, each suspected error word is sequentially replaced by the corresponding target correction word, the target correction word of the current suspected error word is determined according to the text sentence obtained after the last suspected error word is replaced, and the iteration is repeated in such a way until the suspected error words are replaced by the corresponding target correction words, so that the final corrected text sentence is obtained.

In this embodiment, for each suspected wrong word, the corrected text statement is determined according to the confusion degree of the candidate text statement, so that each suspected wrong word is sequentially replaced to obtain a corrected text statement, and the corrected text statement is determined according to the confusion degree, so that the corrected text statement can be determined more accurately and conveniently, and a situation that a plurality of suspected wrong words exist in the text statement can be processed.

Fig. 5 is a schematic flow chart of determining a candidate corrected word set and then determining a corrected text sentence according to the confusion degree in the above embodiment. Firstly, determining a candidate word set which is the same as or similar to the suspected wrong word pinyin, then sorting the candidate words in the candidate word set from large to small according to word frequency, taking a preset number of candidate words as candidate correction words, replacing the suspected wrong words with each candidate correction word to obtain a candidate text statement set, calculating the confusion degree in each candidate text statement, and selecting the candidate text statement with the minimum confusion degree as the corrected candidate text statement.

In one embodiment, the training of the N-gram language model comprises: constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction; preprocessing a corpus of documents; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of the language model training tool; and training and generating an N-element language model through a language model training tool according to the preprocessed positive language database.

The noise characters are characters except for Chinese characters in the positive corpus. A language model training tool is a tool for training a language model. The input format of the language model training tool is the format of the input data to be trained (for example, the input data may be in a format in which characters are separated from one another) required by the language model training tool.

In one embodiment, the noisy characters may include at least one of english letters, numbers, punctuation marks, and the like.

It will be appreciated that because the present application employs word-level processing, the original format of the positive corpus is raw text data that has not been subject to chinese word segmentation.

In one embodiment, the server may remove the noise characters in the positive corpus, separate the space between the words in each line in the positive corpus, and then train the text data in the documents in the preprocessed positive corpus by using the language model training tool kenlm to generate the N-ary language model. Wherein, the value of N can be set according to requirements.

In this embodiment, the N-gram language model is generated by training according to the preprocessed positive corpus, so that word-level processing can be performed through the N-gram language model, and the influence on the accuracy of an error correction result due to word segmentation errors can be avoided, thereby improving the accuracy of text error correction.

Fig. 6 is a schematic overall flow chart of the text error correction method in the foregoing embodiments. Firstly, inputting a text sentence to be corrected, detecting suspected wrong characters in the text sentence through a generated character-level N-element language model, replacing the suspected wrong characters with target corrected characters according to a pre-constructed pinyin Chinese character comparison table dictionary, a positive corpus and a standard character dictionary and the generated N-element language model, judging the correction of a correction result, recalling the wrong correction result, and finally outputting the text sentence with the recalled correction result.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 7, there is provided a text correction apparatus 700 including: an obtaining module 702, a probability determining module 704, a suspected erroneous word identifying module 706, a candidate corrected word determining module 708, and an error correcting module 710, wherein:

an obtaining module 702, configured to obtain a text statement to be corrected.

A probability determination module 704, configured to determine an N-gram probability set of a text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-gram probability set comprises N-gram probabilities of each word in the text sentence.

And a suspected wrong word recognition module 706, configured to recognize a suspected wrong word in the text sentence according to the N-gram probability set.

The candidate corrected word determining module 708 is configured to obtain a set of candidate corrected words corresponding to the suspected erroneous word.

And the error correction module 710 is configured to screen a target correction word corresponding to the suspected erroneous word from the candidate correction word set according to the N-ary language model, and replace each suspected erroneous word in the text statement with a corresponding target correction word to obtain a corrected text statement.

In one embodiment, the text correction apparatus 700 further includes:

an error correction result recall module 712, configured to determine a similarity between the corrected text statement and each document in the corpus; according to the similarity, sorting the documents according to the sequence of the similarity from big to small, and selecting the documents with the preset number as candidate documents; forming a candidate dictionary according to the candidate documents; and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.

In one embodiment, the probability determination module 704 is further configured to determine an N-gram set of text sentences to be corrected; determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus; and determining the N-element grammar probability corresponding to each word in the text sentence according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text sentence.

In one embodiment, the suspected erroneous word identification module 706 is further configured to determine an average, an absolute error, and an average absolute error of the N-gram probabilities in the set of N-gram probabilities; determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error; and when the N-element grammar probability in the N-element grammar probability set is smaller than the average value and the probability critical value corresponding to the N-element grammar probability is larger than a preset threshold value, judging that the word corresponding to the N-element grammar probability in the text sentence is a suspected error word.

In one embodiment, the candidate corrected word determination module 708 is further configured to determine a set of candidate words corresponding to the suspected erroneous word; the pinyin of each candidate character in the candidate character set is the same as or similar to the pinyin of the suspected wrong character; according to the word frequency of candidate words in a candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency; and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of suspected wrong characters.

In one embodiment, the error correction module 710 is further configured to replace the suspected erroneous words in the text statement to be corrected with the candidate corrected words in the candidate corrected word set corresponding to the suspected erroneous words to obtain a candidate text statement set corresponding to the suspected erroneous words; respectively determining the confusion degree of each candidate text statement in the candidate text statement set through an N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is the text statement obtained after the suspected wrong words are corrected and replaced by the target corrected words.

In one embodiment, the suspected erroneous word is plural. The error correction module 710 is further configured to reselect a suspected erroneous word from the suspected erroneous word included in the corrected text statement corresponding to the previous suspected erroneous word as a suspected erroneous word to be corrected, use the corrected text statement corresponding to the previous suspected erroneous word as a text statement to be corrected, execute each candidate corrected word in the candidate corrected word set corresponding to the suspected erroneous word to be corrected, respectively replace the suspected erroneous word in the text statement to be corrected to continue execution, and obtain a final corrected text statement until no suspected erroneous word exists in the corrected text statement.

In one embodiment, as shown in fig. 8, the text correction apparatus 700 further includes:

a model training module 714 for constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction; preprocessing a corpus of documents; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of the language model training tool; and training and generating an N-element language model through a language model training tool according to the preprocessed positive language database.

In the text error correction device, the N-gram probability set of the text sentence to be corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, and the N-gram probability set comprises the N-gram probability of each word in the text sentence, so that the suspected wrong word in the text sentence can be recognized according to the N-gram probability set, and the wrong word recognition is performed on the word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved. In addition, different from a method for constructing a negative corpus, the device only needs to construct a positive corpus, so that the problem that the negative corpus (namely, text sentences containing wrong characters or words) is difficult to predict is avoided, and the practicability of the device is improved.

For the specific limitation of the text error correction device, reference may be made to the above limitation of the text error correction method, which is not described herein again. The respective modules in the text error correction apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing corpus data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text correction method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-ONly Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile Memory can include RaNdom Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static RaNdom Access Memory (SRAM) or DyNamic RaNdom Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for correcting text, the method comprising:

acquiring a text sentence to be corrected;

2. The method of claim 1, further comprising:

forming a candidate dictionary according to each candidate document;

3. The method of claim 1, wherein determining the N-gram probability set for the text sentence through an N-gram language model trained based on a pre-constructed positive corpus comprises:

Determining an N-element grammar set of the text sentence to be corrected;

4. The method of claim 1, wherein identifying suspected incorrect words in the textual statement according to the set of N-gram probabilities comprises:

5. The method of claim 1, wherein the obtaining the set of candidate correction words corresponding to the suspected erroneous word comprises:

6. The method according to claim 1, wherein the screening, according to the N-gram language model, a target correction word corresponding to the suspected erroneous word from the candidate correction word set, and replacing each suspected erroneous word in the text sentence with a corresponding target correction word to obtain a corrected text sentence includes:

7. The method of claim 6, wherein the suspected erroneous word is plural;

8. The method of claim 1, wherein the training of the N-gram language model comprises:

9. A text correction apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a text sentence to be corrected;

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.