CN111859921A - Text error correction method and device, computer equipment and storage medium - Google Patents

Text error correction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111859921A
CN111859921A CN202010650353.1A CN202010650353A CN111859921A CN 111859921 A CN111859921 A CN 111859921A CN 202010650353 A CN202010650353 A CN 202010650353A CN 111859921 A CN111859921 A CN 111859921A
Authority
CN
China
Prior art keywords
word
corrected
candidate
text
suspected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010650353.1A
Other languages
Chinese (zh)
Other versions
CN111859921B (en
Inventor
吕海峰
宁义双
宁可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kingdee Software China Co Ltd
Original Assignee
Kingdee Software China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kingdee Software China Co Ltd filed Critical Kingdee Software China Co Ltd
Priority to CN202010650353.1A priority Critical patent/CN111859921B/en
Publication of CN111859921A publication Critical patent/CN111859921A/en
Application granted granted Critical
Publication of CN111859921B publication Critical patent/CN111859921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a text error correction method, a text error correction device, a computer device and a storage medium. The method comprises the following steps: acquiring a text sentence to be corrected; determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement; according to the N-element grammar probability set, identifying suspected wrong words in the text sentences; acquiring a candidate correction word set corresponding to the suspected error word; and according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with a corresponding target correction word to obtain a corrected text statement. By adopting the method, the accuracy of text error correction can be improved.

Description

Text error correction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technology and natural language processing technology, and in particular, to a text error correction method, apparatus, computer device, and storage medium.
Background
With the development of natural language processing technology, text error correction technology has emerged, which has important applications, such as: in a text obtained by speech recognition, errors such as homophones, similar-tone characters, wrongly written characters and the like often occur due to the influence of factors such as environment, accent, equipment and the like, and therefore, it is necessary to correct the wrongly written characters in the text by a text error correction technique.
In the conventional technology, text error correction is generally required through word segmentation, however, if errors occur in word segmentation results, recognition results of wrong words or wrong words in the text are easily affected, and therefore accuracy of text error correction is reduced.
Disclosure of Invention
In view of the above, it is necessary to provide a text error correction method, apparatus, computer device and storage medium capable of improving the accuracy of text error correction.
A method of text correction, the method comprising:
acquiring a text sentence to be corrected;
determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;
According to the N-element grammar probability set, identifying suspected wrong words in the text sentences;
acquiring a candidate correction word set corresponding to the suspected error word;
and according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with a corresponding target correction word to obtain a corrected text statement.
In one embodiment, the method further comprises:
determining a similarity between the corrected text sentence and each document in the positive corpus;
according to the similarity, the documents are sorted in the sequence of the similarity from big to small, and a preset number of documents are selected as candidate documents;
forming a candidate dictionary according to each candidate document;
and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.
In one embodiment, the determining the N-gram probability set of the text sentence by using the N-gram language model trained based on the pre-constructed positive corpus comprises:
Determining an N-element grammar set of the text sentence to be corrected;
determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus;
and determining the N-element grammar probability corresponding to each word in the text statement according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text statement.
In one embodiment, the identifying the suspected incorrect word in the text sentence according to the N-gram probability set includes:
determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set;
determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error;
and when the N-gram probability in the N-gram probability set is smaller than the average value and the probability critical value corresponding to the N-gram probability is larger than a preset threshold value, determining that the word corresponding to the N-gram probability in the text sentence is a suspected wrong word.
In one embodiment, the obtaining the set of candidate correction words corresponding to the suspected error word includes:
Determining a candidate word set corresponding to the suspected wrong word; the pinyin of each candidate word in the candidate word set is the same as or similar to the pinyin of the suspected wrong word;
according to word frequency of candidate words in the candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency;
and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of the suspected wrong characters.
In one embodiment, the screening, according to the N-gram language model, a target correction word corresponding to the suspected erroneous word from the candidate correction word set, and replacing each suspected erroneous word in the text sentence with a corresponding target correction word to obtain a corrected text sentence includes:
respectively replacing each candidate corrected word in the candidate corrected word set corresponding to the suspected erroneous word to be corrected with the suspected erroneous word in the text statement to be corrected to obtain a candidate text statement set corresponding to the suspected erroneous word;
Respectively determining the confusion degree of each candidate text statement in the candidate text statement set through the N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is obtained after the suspected wrong words are corrected and replaced by the target corrected words.
In one embodiment, the suspected erroneous word is multiple;
the screening, according to the N-gram language model, target corrected words corresponding to the suspected erroneous words from the candidate corrected word set, and replacing each suspected erroneous word in the text sentence with a corresponding target corrected word to obtain a corrected text sentence further includes:
and re-selecting suspected wrong words from the suspected wrong words in the corrected text sentence corresponding to the last suspected wrong word as suspected wrong words to be corrected, taking the corrected text sentence corresponding to the last suspected wrong word as a text sentence to be corrected, executing each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong word to be corrected, respectively replacing the suspected wrong words in the text sentence to be corrected so as to continue executing until the suspected wrong words do not exist in the corrected text sentence, and obtaining the final corrected text sentence.
In one embodiment, the training step of the N-gram language model includes:
constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction;
preprocessing the positive corpus; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of a language model training tool;
and training and generating the N-element language model through the language model training tool according to the preprocessed positive language database.
A text correction apparatus, the apparatus comprising:
the acquisition module is used for acquiring a text sentence to be corrected;
the probability determination module is used for determining an N-gram probability set of the text statement through an N-gram language model trained on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;
the suspected wrong word identification module is used for identifying the suspected wrong words in the text sentences according to the N-element grammar probability set;
a candidate corrected word determining module, configured to obtain a candidate corrected word set corresponding to the suspected erroneous word;
And the error correction module is used for screening target correction words corresponding to the suspected error words from the candidate correction word set according to the N-element language model, and replacing each suspected error word in the text statement with a corresponding target correction word to obtain a corrected text statement.
A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of the text correction method according to embodiments of the present application.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of a text correction method as described in embodiments of the present application.
According to the text error correction method, the text error correction device, the computer equipment and the storage medium, the N-gram probability set of the text sentence to be corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, the N-gram probability set comprises the N-gram probability of each word in the text sentence, and therefore suspected wrong words in the text sentence can be recognized according to the N-gram probability set, wrong word recognition is conducted on the word level, and wrong word recognition is conducted on the word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved.
Drawings
FIG. 1 is a flow diagram illustrating a text correction method according to one embodiment;
FIG. 2 is a flow diagram illustrating error correction result recall in one embodiment;
FIG. 3 is a schematic flow chart illustrating the determination of probability of a binary grammar in one embodiment;
FIG. 4 is a flow diagram illustrating the identification of suspected miswords in a textual statement in one embodiment;
FIG. 5 is a flowchart illustrating an embodiment of error correction for suspected erroneous words;
FIG. 6 is a schematic diagram illustrating an overall flowchart of a text correction method according to an embodiment;
FIG. 7 is a block diagram showing the structure of a text correction apparatus according to an embodiment;
FIG. 8 is a block diagram showing the structure of a text correction apparatus in another embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In an embodiment, as shown in fig. 1, a text error correction method is provided, and this embodiment is illustrated by applying the method to a server, and it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:
S102, obtaining a text sentence to be corrected.
The text sentence is a sentence in text form. The text sentence to be corrected is a text sentence to be corrected, that is, the text sentence to be corrected contains a word with an error (for example, a homophone word, a similar-tone word or a wrongly written word).
In one embodiment, the text sentence to be corrected may be a text sentence obtained by converting speech into text information through a speech recognition technique. It can be understood that in the speech recognition process, due to the influence of environmental noise, device performance, accent and other factors, the obtained text sentence may contain incorrect words such as homophones, similar-tone words or wrongly-written words.
In another embodiment, the text sentence to be corrected may also be a text sentence obtained by manual input. It is understood that in the process of manually inputting text, the text sentence may contain wrong words due to human input errors and the like.
In other embodiments, the text sentence to be corrected may also be a text sentence obtained through other approaches, which is not limited.
S104, determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-gram probability set comprises N-gram probabilities of each word in the text sentence.
The positive corpus is composed of text corpora without error correction. The text corpus without error correction refers to a text corpus without error characters or words. The positive corpus contains a plurality of documents. An N-gram language model, which is a statistical-based language model. N in the N-element language model can take any positive integer, and can generally take 2 or 3. Such as: when the value of N is 2, the model is a 2-gram model; and when the value of N is 3, the model is a 3-gram model. The N-gram probability is the probability corresponding to each word in the text sentence obtained through the N-gram language model after the text sentence is converted into the form of the N-gram. Words, refer to individual words rather than words. Such as: "I" and "s" are separate words, respectively, and "we" is a word.
In one embodiment, the positive corpus may be composed of domain-specific text corpora that do not require error correction. Aiming at the error correction of the text sentences in the specific field, the positive corpus belonging to the field is constructed, so that the search range of the error correction process is more targeted, and the accuracy of the error correction can be improved. Such as: the specific field can be an intelligent customer service robot scene, the quality of the voice recognition text in the robot conversation process can be improved, the text after the quality improvement can improve the accuracy of the robot intention recognition in the specific field application, and further the fluency of the robot conversation process is improved. In other embodiments, the positive corpus may also include text corpora of multiple domains, not necessarily specific domains.
Specifically, the server may convert the text statement into an N-gram form, and then determine an N-gram probability corresponding to each word in the text statement through the N-gram language model.
And S106, identifying suspected wrong words in the text sentence according to the N-element grammar probability set.
The suspected wrong words are words which are determined according to the N-element grammar probability set and possibly have errors in the text sentences. It is understood that the number of suspected miswords in the text sentence may be one or more.
Specifically, the server may determine the suspected incorrect word in the text sentence according to the N-gram probability corresponding to each word in the text sentence in the N-gram probability set.
And S108, acquiring a candidate correction word set corresponding to the suspected error word.
The correction word is a word obtained by correcting the suspected erroneous word, that is, a word used for replacing the suspected erroneous word in the text sentence. The candidate corrected word is a candidate corrected word. The candidate corrected word set is a set composed of a plurality of candidate corrected words.
Specifically, the server may first determine a plurality of candidate words that are the same as or similar to the suspected incorrect word pinyin, and then select a candidate corrected word from the candidate words according to the word frequency of each candidate word in the positive corpus.
S110, according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with the corresponding target correction word to obtain the corrected text statement.
The target corrected word is a corrected word which is finally selected from the candidate corrected word set and is used for replacing a suspected error word in the text statement. The corrected text statement is obtained by replacing each suspected error word in the text statement to be corrected with a corresponding target correction word.
In one embodiment, when the number of suspected erroneous words in the text sentence to be corrected is one, the server directly determines the target corrected words corresponding to the suspected erroneous words, and replaces the suspected erroneous words in the text sentence with the corresponding target corrected words to obtain the corrected text sentence.
In another embodiment, when the number of suspected erroneous words in the text sentence to be corrected is multiple, the server sequentially determines the target corrected words corresponding to each suspected erroneous word, and in the process of determining the target corrected words, the text sentence obtained after replacing the previous suspected erroneous words with the corresponding target corrected words is used as the current text sentence to be corrected to determine the target corrected words corresponding to the current suspected erroneous words and replace the current suspected erroneous words with the corresponding target corrected words, so that each suspected erroneous word is cyclically replaced until each suspected erroneous word in the text sentence is replaced with the corresponding target corrected word, and the final corrected text sentence is obtained.
In the text error correction method, the N-gram probability set of the text sentence to be error-corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, wherein the N-gram probability set comprises the N-gram probability of each word in the text sentence, so that suspected wrong words in the text sentence can be recognized according to the N-gram probability set, wrong word recognition is performed on a word level, and wrong word recognition is performed on a word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved. In addition, different from a method needing to construct a negative corpus, the method only needs to construct a positive corpus, so that the problem that the negative corpus (namely, text sentences containing wrong characters or words) is difficult to predict is avoided, and the practicability of the method is improved.
In one embodiment, the method further comprises the steps of: determining the similarity between the corrected text sentences and each document in the corpus; according to the similarity, sorting the documents according to the sequence of the similarity from big to small, and selecting the documents with the preset number as candidate documents; forming a candidate dictionary according to the candidate documents; and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.
The candidate documents are selected from the documents in the positive corpus according to the similarity and are used for forming the candidate dictionary. And the candidate dictionary is used for judging whether the target correction words replaced by the suspected error words are accurate, namely judging whether the text correction result is accurate.
In an embodiment, the preset number of the selected documents with the previous preset number may be set in advance according to an actual situation, and the specific number is not limited.
In one embodiment, the server may employ the BM25 algorithm (an algorithm for determining the relevance between search terms and documents) to determine the similarity between the corrected text statement and each document in the positive corpus. It can be understood that the larger the BM25 score obtained by the BM25 algorithm, the greater the correlation between the corrected text sentence and the document corresponding to the BM25 score, and the greater the similarity. Conversely, the smaller the BM25 score obtained by the BM25 algorithm, the smaller the similarity between the corrected text sentence and the document corresponding to the BM25 score. In other embodiments, the server may also use other methods to determine the similarity between the corrected text sentence and each document in the positive corpus, which is not limited.
In one embodiment, the candidate dictionary may be composed of all words in each candidate document.
In one embodiment, the server may detect whether a target correction word, in which each suspected error word in the text sentence is replaced, exists in the candidate dictionary, and then recall or confirm the target correction word according to the detection result. For each target correction word, when the target correction word does not exist in the candidate dictionary, it indicates that error correction is wrong, and the target correction word is restored to a corresponding suspected error word before error correction (i.e. error correction result recall); when the target correction word exists in the candidate dictionary, the error correction is correct, and the target correction word is not restored. And outputting the finally obtained text sentence as the text sentence after the target correction words corresponding to all the suspected error words in the text sentence are detected and the text sentence is recalled according to the detection result or no error is confirmed.
As shown in fig. 2, as a flow diagram for recalling error correction results, first, BM25 scores of each document in the corrected text sentence and the positive corpus are calculated, then the documents are sorted from large to small according to BM25 scores, a preset number of documents before are output as candidate documents, a candidate dictionary is generated according to the candidate documents, then whether a target correction word exists in the candidate dictionary is judged, if yes, the error correction is correct, and the corrected text sentence is output; if not, error correction is carried out, and the target correction word is restored to a suspected error word before error correction.
In this embodiment, according to the similarity between the corrected text sentence and each document in the positive corpus, candidate documents are determined, and then a candidate dictionary is formed, whether error correction result recall is performed is determined according to whether the target corrected word replaced by the suspected erroneous word exists in the candidate dictionary, and the error correction result with errors in error correction is recalled, so that the misjudgment rate of text error correction can be reduced, and the accuracy of text error correction is improved.
In one embodiment, the step S104 determines the N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus, and specifically includes the following steps: determining an N-element grammar set of a text sentence to be corrected; determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus; and determining the N-element grammar probability corresponding to each word in the text sentence according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text sentence.
The N-gram set is a set composed of candidate items in which a text sentence to be corrected is expressed in the form of an N-gram.
In one embodiment, the server performs a sliding window operation with a size of N on the content in the text sentence according to characters, and generates a plurality of character segments with a length of N, where each character segment is a candidate item in the N-gram set. For example: assuming that the text sentence is "tomorrow goes to Beijing for business trip", and the selected N is 2, the binary grammar set of the text sentence is [ "tomorrow", "north removal", "Beijing out", "business trip" ].
In one embodiment, the server may load an N-gram language model and then sequentially determine an N-gram probability for each candidate in the N-gram set based on the model file of the N-gram language model. Aiming at each candidate item in the N-element grammar set, firstly, directly searching the candidate item in a model file, and when the candidate item is searched, determining the N-element grammar probability corresponding to the candidate item according to the value in the probability field corresponding to the candidate item in the model file; when the candidate item is not found, the candidate item is split into a form of N-1 meta grammar to obtain a plurality of split sub candidate items, the sub candidate items are found in the model file, and when at least one of the sub candidate items is found, the probability of the N meta grammar corresponding to the candidate item (namely the candidate item before splitting) is determined according to the value in the probability field corresponding to each sub candidate item found in the model file; and when any sub candidate item is not found, splitting the candidate item into N-2 meta-grammars, searching in the model file, and so on until the split sub candidate item is found, and determining the probability of the N meta-grammars corresponding to the candidate item according to the value in the probability field corresponding to each searched sub candidate item in the meta-grammars in the model file.
For example: assuming that the selected N is 2, and the binary grammar set of the text sentence is [ "tomorrow", "Tian go", "north removal", "beijing", "shang go", or "business turn" ], the determination step of the binary grammar probability of the candidate item "tomorrow" in the binary grammar set is:
s1, loading a binary language model, wherein the model file of the binary language model comprises three fields of pro, word and back _ pro.
S2, directly searching tomorrow in the model file, when tomorrow is found, determining the value of pro field corresponding to tomorrow in the model file as the binary grammar probability of tomorrow, and ending the search. When "tomorrow" is not found, step S3 is performed.
S3, dividing the Mingtian into the form of unary grammar, namely Ming and day, and then respectively searching the Mingtian and the day in the model file. When both "bright" and "day" are found, the product of the value of the back _ pro field corresponding to "bright" and the value of the pro field corresponding to "day", i.e., back _ pro ("bright"). pro ("day") in the model file is determined as the binary grammar probability of "bright day". When only "bright" is found and "day" is not found, back _ pro ("bright"). pro ("unk") is determined as the binary grammar probability of "bright day", that is, the binary grammar probability of the candidate item "bright day" before splitting is determined according to the value of the back _ pro field corresponding to the sub candidate item "bright" after splitting. When only "day" is found, and "clear" is found, back _ pro ("unk"). pro ("day") is determined as the binary grammar probability of "clear day", that is, the binary grammar probability of the candidate item "clear day" before splitting is determined according to the value of the pro field corresponding to the sub-candidate item "day" after splitting. Here, unk denotes an unregistered word.
It can be understood that the determination steps of the binary grammar probabilities of other candidate items in the binary grammar set [ "tomorrow", "Tian go", "De-North", "Beijing", "Jing go", "business out" ] are the same as above.
Fig. 3 is a schematic flow chart of determining the probability of the bigram for each candidate in the bigram set through the bigram language model, and the specific process is the same as the above example.
In an embodiment, the server may obtain the N-gram probability set of the text sentence by performing ending completion on a set composed of the obtained N-gram probabilities of each candidate item, and then averaging neighboring items to determine the N-gram probability corresponding to each word in the text sentence. For example: assuming that the set of binary grammar probabilities corresponding to each candidate in the binary grammar set [ "tomorrow", "Tian De", "De-North", "Beijing out", "out" ] is [ a, b, c, d, e, f ], where each element represents the binary grammar probability corresponding to each candidate, the server completes the ending of the pair [ a, b, c, d, e, f ] (the first binary grammar probability may be added at the head, i.e. a is added before a, and the last binary grammar probability is added at the tail, i.e. e is added after e) to obtain [ a, a, b, c, d, e, f, f ], and then averages the neighboring items in [ a, a, b, c, d, e, f, f ], i.e., [ (a + a)/2, (a + b)/2, (b + c)/2, each element in [ q, w, e, r, t, u, i ] obtained by (c + d)/2, (d + e)/2, (e + f)/2] ═ q, w, e, r, t, u, i ] respectively represents a binary grammar probability corresponding to each word in the text statement "tomorrow to beijing business turn over", and [ q, w, e, r, t, u, i ] is a binary grammar probability set of the text statement "tomorrow to beijing business turn over".
In the embodiment, the N-gram probability set of the text sentence is determined based on the N-gram language model trained by the pre-constructed positive corpus, so that the suspected wrong words in the text sentence can be identified according to the N-gram probability set, the wrong words are identified at the word level without word segmentation, the wrong words are identified at the word level, the influence on the accuracy of an error correction result due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved.
In one embodiment, the step S106 is to identify a suspected incorrect word in the text sentence according to the N-gram probability set, and specifically includes the following steps: determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set; determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error; and when the N-element grammar probability in the N-element grammar probability set is smaller than the average value and the probability critical value corresponding to the N-element grammar probability is larger than a preset threshold value, judging that the word corresponding to the N-element grammar probability in the text sentence is a suspected error word.
Where the Absolute error (AD) is the Absolute value of the Deviation of all individual observations from the arithmetic mean. Mean Absolute Deviation (MAD), is the average of Absolute errors, i.e., the average of the Absolute values of the deviations of all individual observations from the arithmetic Mean. The average absolute error can avoid the problem of mutual offset of errors, so that the size of the actual prediction error can be accurately reflected. And a probability threshold value used for indicating the possibility that the corresponding word in the text sentence is a suspected wrong word. The larger the probability threshold, the greater the probability that the corresponding word in the text sentence is a suspected erroneous word. The smaller the probability threshold, the less likely the corresponding word in the text sentence is to be a suspected wrong word, i.e., the more likely it is to be a correct word. It is understood that the probability threshold obtained from the absolute error and the average absolute error is a probability threshold corresponding to each N-gram probability in the N-gram probability set, and each N-gram probability in the N-gram probability set corresponds to each word in the text sentence, so that the obtained probability threshold can correspond to each word in the text sentence.
In one embodiment, the server may increase each N-gram probability in the N-gram probability set by one dimension, and then determine the absolute error of the N-gram probability in the N-gram probability set according to the N-gram probability increased by one dimension. Each item contained in the obtained absolute error corresponds to each N-element grammar probability in the N-element grammar probability set respectively and also corresponds to each word in the text sentence respectively. It can be understood that each N-gram probability in the N-gram probability set is added with one dimension for calculating an absolute error, and in other embodiments, the N-gram probability sets are all N-gram probability sets without one dimension added.
For example: assuming that the set of N-gram probabilities is [ q, w, e, r, t, u, i ], each of the N-gram probabilities is increased by one dimension of [ [ q ], [ w ], [ e ], [ r ], [ t ], [ u ], [ i ], [ sc ], [ q ], [ w ], [ e ], [ r ], [ t ], [ u ], [ i ], the average value of sc is [ avg ], [ q + w + e + r + t + u + i ], [ m ], the absolute error AD is [ sum (i-avg) [ 2for i in sc ], [ c1, c2, c3, c4, c5, c6, c7], that is, the sum of the squares of the differences between each term of sc and the average value avg. The AD comprises 7 items, wherein each item corresponds to each item in the probability set [ q, w, e, r, t, u, i ] of the N-element grammar respectively, and also corresponds to each word in a text statement 'going to Beijing on tomorrow' respectively. The average absolute error MAD is the average of the terms in AD, i.e., MAD ═ c1+ c2+ c3+ c4+ c5+ c6+ c 7/7. It is understood that the average value avg of one dimension is added, and is also used for calculating the absolute error, and in other embodiments, the average value avg is the average value without adding one dimension.
In one embodiment, the server may determine the set of probability thresholds according to the following formula:
y=ratio*AD/MAD;
wherein y is a probability critical value set, ratio is a preset hyperparameter, AD is an absolute error, and MAD is an average absolute error. It can be understood that, since the number of terms in AD coincides with the number of words in the text sentence, the number of terms included in the probability threshold set obtained according to the above formula coincides with the number of terms in AD. For example: if 7 entries are included in AD [ c1, c2, c3, c4, c5, c6, c7], then 7 entries are also included in the corresponding set y of probability threshold values, where each probability threshold value corresponds to each word in the text sentence.
In one embodiment, when the N-gram probability in the N-gram probability set is smaller than the average value (the average value of the N-gram probabilities in the N-gram probability set) and the probability threshold corresponding to the N-gram probability is greater than the preset threshold, it is determined that the word corresponding to the N-gram probability in the text sentence is the suspected incorrect word. And the words corresponding to the N-element grammar probabilities which do not meet the conditions in the text sentences are not suspected wrong words.
In one embodiment, the server may determine a corresponding index (index, i.e., a position in the text sentence) of the suspected wrong word in the text sentence according to the N-gram probability satisfying the condition for determining the suspected wrong word.
Fig. 4 is a schematic flow chart illustrating a process of identifying suspected wrong words in a text sentence according to an N-gram probability set. Firstly, determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set, then determining a probability critical value, detecting suspected wrong words according to two judgment conditions that the probability of the N-element grammar is smaller than the average value and the probability critical value is larger than a preset threshold value and the probability critical value is 1, and when the probability of the N-element grammar simultaneously meets the two conditions, the corresponding words of the probability of the N-element grammar in the text sentences are suspected wrong words, and outputting indexes of the suspected wrong words in the text sentences.
In the embodiment, the suspected wrong words in the text sentence are determined according to the N-gram probabilities in the N-gram probability set, so that the wrong words are identified on the word level without word segmentation, and the error word identification is performed on the word level, so that the influence on the accuracy of an error correction result due to the error word segmentation can be avoided, and the accuracy of text error correction can be improved.
In one embodiment, the step S108 of obtaining the candidate corrected word set corresponding to the suspected erroneous word includes the following steps: determining a candidate word set corresponding to the suspected wrong word; the pinyin of each candidate character in the candidate character set is the same as or similar to the pinyin of the suspected wrong character; according to the word frequency of the candidate words in the candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency; and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of suspected wrong characters.
The word frequency is used for representing the frequency of the words in the standard word dictionary appearing in the positive corpus.
In one embodiment, the pinyin Chinese character comparison table dictionary may be constructed in advance according to the collected pinyin Chinese character comparison tables of common Chinese characters (e.g., 3500 common Chinese characters). The pinyin-hanzi lookup table dictionary may contain a plurality of pinyins and a plurality of hanzi corresponding to each pinyin. The server can convert the suspected wrong character into the corresponding pinyin, then look up the pinyin and the pinyin similar to the pinyin in the pinyin Chinese character comparison table dictionary, determine the Chinese character corresponding to the found pinyin according to the pinyin Chinese character comparison table dictionary, and form the Chinese character corresponding to the found pinyin into a candidate character set. It can be understood that, because the found pinyin is the pinyin of the suspected wrong word and the pinyin similar to the pinyin of the suspected wrong word, the pinyin of each candidate word in the candidate word set is the same as or similar to the pinyin of the suspected wrong word.
In one embodiment, the server may determine similar pinyins according to a preset threshold or according to an initial equal decision condition. Such as: the Pinyin of Beijing is jing, the Pinyin of Tianjin is jin, and the two Pinyin can be used as similar Pinyin.
In one embodiment, the pinyin-hanzi lookup table dictionary may be in the form of: { …, "shi": is a real-time commercial ten-way representation enabling the world instructor to recognize the beginning of losing the opportunity to apply Shishishi, releasing the Wet going dead and shiy Shimi, … }, wherein ellipses represent other pinyins, and corresponding Chinese characters, in a form consistent with the form of shi shown, not shown here.
In one embodiment, the words in the standard word dictionary may be arranged in descending order of the corresponding word frequency. The standard word dictionary may be expressed in the form of { word: word frequency }, such as: { …, "Ming": 30, "day": 15, … }, wherein the ellipses are other words and corresponding words not shown, not shown here.
In an embodiment, the server may sequentially search each candidate word in the candidate word set in the standard word dictionary, determine a word frequency corresponding to the searched candidate word in the standard word dictionary, and then sort the candidate words according to the word frequency from large to small.
In an embodiment, the preset number of the selected candidate correction words with the first preset number may be preset according to an actual situation, and the specific number is not limited.
In this embodiment, candidate words that are the same as or similar to the pinyin of the suspected incorrect word are determined, and then the candidate corrected words are selected according to the word frequency of each candidate word corresponding to the standard word dictionary, so that the candidate corrected words can be determined more accurately.
In one embodiment, step S110 includes: respectively replacing suspected wrong words in the text sentences to be corrected with each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong words to obtain candidate text sentence sets corresponding to the suspected wrong words; respectively determining the confusion degree of each candidate text statement in the candidate text statement set through an N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is the text statement obtained after the suspected wrong words are corrected and replaced by the target corrected words.
The suspected wrong word to be corrected is the suspected wrong word to be corrected currently. And the candidate text sentences are candidate text sentences obtained by correcting the suspected wrong words to be corrected. That is, the candidate text statement is a text statement obtained by replacing the suspected erroneous word to be corrected currently in the text statement to be corrected currently with the candidate corrected word. It is understood that the candidate text sentences in the candidate text sentence set correspond to the candidate corrected words in the candidate corrected word set one by one. Perplexity (Perplexity) is an index used in the Natural Language Processing (NLP) field to measure the quality of a language model. The smaller the confusion of a candidate text sentence, the greater the probability that the candidate text sentence is indicated, indicating that the candidate text sentence is more correct.
In one embodiment, the server may use the candidate text sentence with the smallest confusion as the corrected text sentence corresponding to the suspected wrong word currently to be corrected.
It can be understood that, when there is one suspected incorrect word in the text sentence to be corrected, the correction processing in this embodiment may be performed to obtain a corrected text sentence corresponding to the suspected incorrect word.
In this embodiment, the confusion degree of each candidate text statement is obtained through the N-gram language model, and then the corrected text statement is selected according to the confusion degree, so that the suspected wrong word is corrected, and the corrected text statement can be determined more accurately and conveniently.
In one embodiment, the suspected erroneous word is plural. Step S110 further includes: and re-selecting the suspected wrong words from the suspected wrong words in the corrected text sentence corresponding to the last suspected wrong word as the suspected wrong words to be corrected, taking the corrected text sentence corresponding to the last suspected wrong word as the text sentence to be corrected, executing each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong word to be corrected, respectively replacing the suspected wrong words in the text sentence to be corrected so as to continue executing the operation until no suspected wrong words exist in the corrected text sentence, and obtaining the final corrected text sentence.
It can be understood that the last suspected wrong word refers to a suspected wrong word corrected in the text sentence to be corrected.
It will be appreciated that the essence of the above described embodiment is: when the suspected error words in the text sentence to be corrected are multiple, each suspected error word is sequentially replaced by the corresponding target correction word, the target correction word of the current suspected error word is determined according to the text sentence obtained after the last suspected error word is replaced, and the iteration is repeated in such a way until the suspected error words are replaced by the corresponding target correction words, so that the final corrected text sentence is obtained.
In this embodiment, for each suspected wrong word, the corrected text statement is determined according to the confusion degree of the candidate text statement, so that each suspected wrong word is sequentially replaced to obtain a corrected text statement, and the corrected text statement is determined according to the confusion degree, so that the corrected text statement can be determined more accurately and conveniently, and a situation that a plurality of suspected wrong words exist in the text statement can be processed.
Fig. 5 is a schematic flow chart of determining a candidate corrected word set and then determining a corrected text sentence according to the confusion degree in the above embodiment. Firstly, determining a candidate word set which is the same as or similar to the suspected wrong word pinyin, then sorting the candidate words in the candidate word set from large to small according to word frequency, taking a preset number of candidate words as candidate correction words, replacing the suspected wrong words with each candidate correction word to obtain a candidate text statement set, calculating the confusion degree in each candidate text statement, and selecting the candidate text statement with the minimum confusion degree as the corrected candidate text statement.
In one embodiment, the training of the N-gram language model comprises: constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction; preprocessing a corpus of documents; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of the language model training tool; and training and generating an N-element language model through a language model training tool according to the preprocessed positive language database.
The noise characters are characters except for Chinese characters in the positive corpus. A language model training tool is a tool for training a language model. The input format of the language model training tool is the format of the input data to be trained (for example, the input data may be in a format in which characters are separated from one another) required by the language model training tool.
In one embodiment, the noisy characters may include at least one of english letters, numbers, punctuation marks, and the like.
It will be appreciated that because the present application employs word-level processing, the original format of the positive corpus is raw text data that has not been subject to chinese word segmentation.
In one embodiment, the server may remove the noise characters in the positive corpus, separate the space between the words in each line in the positive corpus, and then train the text data in the documents in the preprocessed positive corpus by using the language model training tool kenlm to generate the N-ary language model. Wherein, the value of N can be set according to requirements.
In this embodiment, the N-gram language model is generated by training according to the preprocessed positive corpus, so that word-level processing can be performed through the N-gram language model, and the influence on the accuracy of an error correction result due to word segmentation errors can be avoided, thereby improving the accuracy of text error correction.
Fig. 6 is a schematic overall flow chart of the text error correction method in the foregoing embodiments. Firstly, inputting a text sentence to be corrected, detecting suspected wrong characters in the text sentence through a generated character-level N-element language model, replacing the suspected wrong characters with target corrected characters according to a pre-constructed pinyin Chinese character comparison table dictionary, a positive corpus and a standard character dictionary and the generated N-element language model, judging the correction of a correction result, recalling the wrong correction result, and finally outputting the text sentence with the recalled correction result.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In one embodiment, as shown in fig. 7, there is provided a text correction apparatus 700 including: an obtaining module 702, a probability determining module 704, a suspected erroneous word identifying module 706, a candidate corrected word determining module 708, and an error correcting module 710, wherein:
an obtaining module 702, configured to obtain a text statement to be corrected.
A probability determination module 704, configured to determine an N-gram probability set of a text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-gram probability set comprises N-gram probabilities of each word in the text sentence.
And a suspected wrong word recognition module 706, configured to recognize a suspected wrong word in the text sentence according to the N-gram probability set.
The candidate corrected word determining module 708 is configured to obtain a set of candidate corrected words corresponding to the suspected erroneous word.
And the error correction module 710 is configured to screen a target correction word corresponding to the suspected erroneous word from the candidate correction word set according to the N-ary language model, and replace each suspected erroneous word in the text statement with a corresponding target correction word to obtain a corrected text statement.
In one embodiment, the text correction apparatus 700 further includes:
an error correction result recall module 712, configured to determine a similarity between the corrected text statement and each document in the corpus; according to the similarity, sorting the documents according to the sequence of the similarity from big to small, and selecting the documents with the preset number as candidate documents; forming a candidate dictionary according to the candidate documents; and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.
In one embodiment, the probability determination module 704 is further configured to determine an N-gram set of text sentences to be corrected; determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus; and determining the N-element grammar probability corresponding to each word in the text sentence according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text sentence.
In one embodiment, the suspected erroneous word identification module 706 is further configured to determine an average, an absolute error, and an average absolute error of the N-gram probabilities in the set of N-gram probabilities; determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error; and when the N-element grammar probability in the N-element grammar probability set is smaller than the average value and the probability critical value corresponding to the N-element grammar probability is larger than a preset threshold value, judging that the word corresponding to the N-element grammar probability in the text sentence is a suspected error word.
In one embodiment, the candidate corrected word determination module 708 is further configured to determine a set of candidate words corresponding to the suspected erroneous word; the pinyin of each candidate character in the candidate character set is the same as or similar to the pinyin of the suspected wrong character; according to the word frequency of candidate words in a candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency; and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of suspected wrong characters.
In one embodiment, the error correction module 710 is further configured to replace the suspected erroneous words in the text statement to be corrected with the candidate corrected words in the candidate corrected word set corresponding to the suspected erroneous words to obtain a candidate text statement set corresponding to the suspected erroneous words; respectively determining the confusion degree of each candidate text statement in the candidate text statement set through an N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is the text statement obtained after the suspected wrong words are corrected and replaced by the target corrected words.
In one embodiment, the suspected erroneous word is plural. The error correction module 710 is further configured to reselect a suspected erroneous word from the suspected erroneous word included in the corrected text statement corresponding to the previous suspected erroneous word as a suspected erroneous word to be corrected, use the corrected text statement corresponding to the previous suspected erroneous word as a text statement to be corrected, execute each candidate corrected word in the candidate corrected word set corresponding to the suspected erroneous word to be corrected, respectively replace the suspected erroneous word in the text statement to be corrected to continue execution, and obtain a final corrected text statement until no suspected erroneous word exists in the corrected text statement.
In one embodiment, as shown in fig. 8, the text correction apparatus 700 further includes:
a model training module 714 for constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction; preprocessing a corpus of documents; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of the language model training tool; and training and generating an N-element language model through a language model training tool according to the preprocessed positive language database.
In the text error correction device, the N-gram probability set of the text sentence to be corrected is determined through the N-gram language model trained based on the pre-constructed positive corpus, and the N-gram probability set comprises the N-gram probability of each word in the text sentence, so that the suspected wrong word in the text sentence can be recognized according to the N-gram probability set, and the wrong word recognition is performed on the word level without word segmentation. The method comprises the steps of obtaining a candidate corrected character set corresponding to suspected wrong characters, screening target corrected characters corresponding to the suspected wrong characters from the candidate corrected character set according to an N-element language model, replacing each suspected wrong character in a text statement with the corresponding target corrected character to obtain a corrected text statement, and therefore error correction is carried out based on a character-level language model without depending on a word segmentation technology, the fact that the accuracy of an error correction result is affected due to word segmentation errors can be avoided, and the accuracy of text error correction can be improved. In addition, different from a method for constructing a negative corpus, the device only needs to construct a positive corpus, so that the problem that the negative corpus (namely, text sentences containing wrong characters or words) is difficult to predict is avoided, and the practicability of the device is improved.
For the specific limitation of the text error correction device, reference may be made to the above limitation of the text error correction method, which is not described herein again. The respective modules in the text error correction apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing corpus data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text correction method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-ONly Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile Memory can include RaNdom Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static RaNdom Access Memory (SRAM) or DyNamic RaNdom Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (11)

1. A method for correcting text, the method comprising:
acquiring a text sentence to be corrected;
determining an N-gram probability set of the text sentence through an N-gram language model trained based on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;
according to the N-element grammar probability set, identifying suspected wrong words in the text sentences;
Acquiring a candidate correction word set corresponding to the suspected error word;
and according to the N-element language model, screening target correction words corresponding to the suspected wrong words from the candidate correction word set, and replacing each suspected wrong word in the text statement with a corresponding target correction word to obtain a corrected text statement.
2. The method of claim 1, further comprising:
determining a similarity between the corrected text sentence and each document in the positive corpus;
according to the similarity, the documents are sorted in the sequence of the similarity from big to small, and a preset number of documents are selected as candidate documents;
forming a candidate dictionary according to each candidate document;
and when the target corrected word replaced by the suspected error word does not exist in the candidate dictionary, restoring the target corrected word into the corresponding suspected error word before error correction.
3. The method of claim 1, wherein determining the N-gram probability set for the text sentence through an N-gram language model trained based on a pre-constructed positive corpus comprises:
Determining an N-element grammar set of the text sentence to be corrected;
determining the N-gram probability of each candidate item in the N-gram set through an N-gram language model trained based on a pre-constructed positive corpus;
and determining the N-element grammar probability corresponding to each word in the text statement according to the N-element grammar probability of each candidate item to obtain an N-element grammar probability set of the text statement.
4. The method of claim 1, wherein identifying suspected incorrect words in the textual statement according to the set of N-gram probabilities comprises:
determining the average value, the absolute error and the average absolute error of the N-element grammar probabilities in the N-element grammar probability set;
determining a probability critical value corresponding to each N-element grammar probability in the N-element grammar probability set according to the ratio of the absolute error to the average absolute error;
and when the N-gram probability in the N-gram probability set is smaller than the average value and the probability critical value corresponding to the N-gram probability is larger than a preset threshold value, determining that the word corresponding to the N-gram probability in the text sentence is a suspected wrong word.
5. The method of claim 1, wherein the obtaining the set of candidate correction words corresponding to the suspected erroneous word comprises:
Determining a candidate word set corresponding to the suspected wrong word; the pinyin of each candidate word in the candidate word set is the same as or similar to the pinyin of the suspected wrong word;
according to word frequency of candidate words in the candidate word set in a pre-constructed standard word dictionary, sequencing the candidate words according to the sequence of the word frequency from large to small; the standard word dictionary is constructed in advance based on the positive corpus and comprises each word in the positive corpus and corresponding word frequency;
and selecting a preset number of candidate corrected characters from the sorted candidate characters to form a candidate corrected character set of the suspected wrong characters.
6. The method according to claim 1, wherein the screening, according to the N-gram language model, a target correction word corresponding to the suspected erroneous word from the candidate correction word set, and replacing each suspected erroneous word in the text sentence with a corresponding target correction word to obtain a corrected text sentence includes:
respectively replacing each candidate corrected word in the candidate corrected word set corresponding to the suspected erroneous word to be corrected with the suspected erroneous word in the text statement to be corrected to obtain a candidate text statement set corresponding to the suspected erroneous word;
Respectively determining the confusion degree of each candidate text statement in the candidate text statement set through the N-element language model, and selecting a corrected text statement corresponding to the suspected wrong word from each candidate text statement according to the confusion degree; and the corrected text statement is obtained after the suspected wrong words are corrected and replaced by the target corrected words.
7. The method of claim 6, wherein the suspected erroneous word is plural;
the screening, according to the N-gram language model, target corrected words corresponding to the suspected erroneous words from the candidate corrected word set, and replacing each suspected erroneous word in the text sentence with a corresponding target corrected word to obtain a corrected text sentence further includes:
and re-selecting suspected wrong words from the suspected wrong words in the corrected text sentence corresponding to the last suspected wrong word as suspected wrong words to be corrected, taking the corrected text sentence corresponding to the last suspected wrong word as a text sentence to be corrected, executing each candidate corrected word in the candidate corrected word set corresponding to the suspected wrong word to be corrected, respectively replacing the suspected wrong words in the text sentence to be corrected so as to continue executing until the suspected wrong words do not exist in the corrected text sentence, and obtaining the final corrected text sentence.
8. The method of claim 1, wherein the training of the N-gram language model comprises:
constructing a positive corpus; the positive corpus comprises a plurality of documents without error correction;
preprocessing the positive corpus; the preprocessing comprises at least one of removing noise characters in the positive corpus and adjusting the format of the documents in the positive corpus to be in accordance with the input format of a language model training tool;
and training and generating the N-element language model through the language model training tool according to the preprocessed positive language database.
9. A text correction apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a text sentence to be corrected;
the probability determination module is used for determining an N-gram probability set of the text statement through an N-gram language model trained on a pre-constructed positive corpus; the N-element grammar probability set comprises the N-element grammar probability of each word in the text statement;
the suspected wrong word identification module is used for identifying the suspected wrong words in the text sentences according to the N-element grammar probability set;
a candidate corrected word determining module, configured to obtain a candidate corrected word set corresponding to the suspected erroneous word;
And the error correction module is used for screening target correction words corresponding to the suspected error words from the candidate correction word set according to the N-element language model, and replacing each suspected error word in the text statement with a corresponding target correction word to obtain a corrected text statement.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN202010650353.1A 2020-07-08 2020-07-08 Text error correction method, apparatus, computer device and storage medium Active CN111859921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010650353.1A CN111859921B (en) 2020-07-08 2020-07-08 Text error correction method, apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010650353.1A CN111859921B (en) 2020-07-08 2020-07-08 Text error correction method, apparatus, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN111859921A true CN111859921A (en) 2020-10-30
CN111859921B CN111859921B (en) 2024-03-08

Family

ID=73152921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010650353.1A Active CN111859921B (en) 2020-07-08 2020-07-08 Text error correction method, apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN111859921B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001169A (en) * 2020-07-17 2020-11-27 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN112800987A (en) * 2021-02-02 2021-05-14 中国联合网络通信集团有限公司 Chinese character processing method and device
CN113011406A (en) * 2021-03-24 2021-06-22 浪潮云信息技术股份公司 Single-template working flow optimization method
CN113192497A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, apparatus, device and medium based on natural language processing
CN113449090A (en) * 2021-06-23 2021-09-28 山东新一代信息产业技术研究院有限公司 Error correction method, device and medium for intelligent question answering
CN113591441A (en) * 2021-07-30 2021-11-02 交互未来(北京)科技有限公司 Voice editing method and device, storage medium and electronic equipment
CN113673294A (en) * 2021-05-11 2021-11-19 苏州超云生命智能产业研究院有限公司 Method and device for extracting key information of document, computer equipment and storage medium
JP2022003539A (en) * 2020-12-11 2022-01-11 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, apparatus, electronic device and storage medium for correcting text errors
CN115146636A (en) * 2022-09-05 2022-10-04 华东交通大学 Method, system and storage medium for correcting errors of Chinese wrongly written characters
CN116502629A (en) * 2023-06-20 2023-07-28 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
KR20160034678A (en) * 2014-09-22 2016-03-30 포항공과대학교 산학협력단 Apparatus for grammatical error correction and method using the same
CN105976818A (en) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 Instruction identification processing method and apparatus thereof
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Modification method, system, computer equipment and the storage medium of word misspelling
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN109086266A (en) * 2018-07-02 2018-12-25 昆明理工大学 A kind of error detection of text nearly word form and proofreading method
CN110276077A (en) * 2019-06-25 2019-09-24 上海应用技术大学 The method, device and equipment of Chinese error correction
CN110427625A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 Sentence complementing method, device, medium and dialog process system
CN110929502A (en) * 2018-08-30 2020-03-27 北京嘀嘀无限科技发展有限公司 Text error detection method and device
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
KR20160034678A (en) * 2014-09-22 2016-03-30 포항공과대학교 산학협력단 Apparatus for grammatical error correction and method using the same
CN105976818A (en) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 Instruction identification processing method and apparatus thereof
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Modification method, system, computer equipment and the storage medium of word misspelling
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN109086266A (en) * 2018-07-02 2018-12-25 昆明理工大学 A kind of error detection of text nearly word form and proofreading method
CN110929502A (en) * 2018-08-30 2020-03-27 北京嘀嘀无限科技发展有限公司 Text error detection method and device
CN110276077A (en) * 2019-06-25 2019-09-24 上海应用技术大学 The method, device and equipment of Chinese error correction
CN110427625A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 Sentence complementing method, device, medium and dialog process system
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
贺敏 等: "一种基于大规模语料的新词识别方法", 《计算机工程与应用》, no. 21, 21 July 2007 (2007-07-21), pages 157 - 159 *
赵岩;王晓龙;刘秉权;关毅;: "融合聚类触发对特征的最大熵词性标注模型", 计算机研究与发展, no. 02, 28 February 2006 (2006-02-28), pages 86 - 92 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001169A (en) * 2020-07-17 2020-11-27 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
JP7267365B2 (en) 2020-12-11 2023-05-01 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Text error correction method, apparatus, electronic equipment and storage medium
JP2022003539A (en) * 2020-12-11 2022-01-11 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, apparatus, electronic device and storage medium for correcting text errors
US11423222B2 (en) * 2020-12-11 2022-08-23 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for text error correction, electronic device and storage medium
CN112800987A (en) * 2021-02-02 2021-05-14 中国联合网络通信集团有限公司 Chinese character processing method and device
CN112800987B (en) * 2021-02-02 2023-07-21 中国联合网络通信集团有限公司 Chinese character processing method and device
CN113011406A (en) * 2021-03-24 2021-06-22 浪潮云信息技术股份公司 Single-template working flow optimization method
CN113192497A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, apparatus, device and medium based on natural language processing
CN113192497B (en) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 Speech recognition method, device, equipment and medium based on natural language processing
CN113673294A (en) * 2021-05-11 2021-11-19 苏州超云生命智能产业研究院有限公司 Method and device for extracting key information of document, computer equipment and storage medium
CN113449090A (en) * 2021-06-23 2021-09-28 山东新一代信息产业技术研究院有限公司 Error correction method, device and medium for intelligent question answering
CN113591441A (en) * 2021-07-30 2021-11-02 交互未来(北京)科技有限公司 Voice editing method and device, storage medium and electronic equipment
CN115146636A (en) * 2022-09-05 2022-10-04 华东交通大学 Method, system and storage medium for correcting errors of Chinese wrongly written characters
CN116502629A (en) * 2023-06-20 2023-07-28 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching
CN116502629B (en) * 2023-06-20 2023-08-18 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching

Also Published As

Publication number Publication date
CN111859921B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN111859921B (en) Text error correction method, apparatus, computer device and storage medium
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
US7478033B2 (en) Systems and methods for translating Chinese pinyin to Chinese characters
CN111753531A (en) Text error correction method and device based on artificial intelligence, computer equipment and storage medium
US20060015326A1 (en) Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN111368918B (en) Text error correction method and device, electronic equipment and storage medium
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
Zhang et al. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm
Saluja et al. Error detection and corrections in Indic OCR using LSTMs
CN113627158A (en) Chinese spelling error correction method and device based on multiple characteristics and multiple pre-training models
JP6145059B2 (en) Model learning device, morphological analysis device, and method
Yang et al. Spell Checking for Chinese.
Das et al. A cost efficient approach to correct OCR errors in large document collections
Doush et al. Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction
Byambakhishig et al. Error correction of automatic speech recognition based on normalized web distance.
Mohapatra et al. Spell checker for OCR
JP3975825B2 (en) Character recognition error correction method, apparatus and program
CN114548075A (en) Text processing method, text processing device, storage medium and electronic equipment
JP3080066B2 (en) Character recognition device, method and storage medium
CN114528824A (en) Text error correction method and device, electronic equipment and storage medium
JP4047895B2 (en) Document proofing apparatus and program storage medium
CN110399608A (en) A kind of conversational system text error correction system and method based on phonetic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant