CN113919326A - Text error correction method and device - Google Patents

Text error correction method and device Download PDF

Info

Publication number
CN113919326A
CN113919326A CN202010646254.6A CN202010646254A CN113919326A CN 113919326 A CN113919326 A CN 113919326A CN 202010646254 A CN202010646254 A CN 202010646254A CN 113919326 A CN113919326 A CN 113919326A
Authority
CN
China
Prior art keywords
text
characters
candidate replacement
character
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010646254.6A
Other languages
Chinese (zh)
Inventor
包祖贻
李辰
王睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010646254.6A priority Critical patent/CN113919326A/en
Publication of CN113919326A publication Critical patent/CN113919326A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A text error correction method and apparatus are disclosed. And performing character segmentation on the text to obtain a plurality of original characters. Generating semantic candidate replacement characters for each Chinese character in a plurality of original characters; and sequencing and decoding the semantic candidate replacement words to generate an error correction result. For example, for at least one original character, a language model may be used to obtain one or more candidate replacement characters and conditional probabilities of the original character and its candidate replacement characters, respectively, based on context in the text. And determining a selected character from the original character and the candidate replacement character thereof aiming at each of at least one original character based on the conditional probability by combining the context, thereby generating an error correction result text. Therefore, the text error correction scheme has good coverage on spelling errors of non-similar pronunciations and non-similar fonts, and can effectively improve the error correction efficiency and performance.

Description

Text error correction method and device
Technical Field
The present disclosure relates to word processing, and more particularly, to text error correction methods and apparatus.
Background
When people use various input tools to input texts, input errors in aspects of spelling, form approximation, sound approximation and the like occur.
On the one hand, misspelling makes text more susceptible to misunderstanding and can affect the efficiency of people's word exchanges. On the other hand, in many strict document scenes, such as judicial works, contracts, and the like, the tolerance to spelling errors is very low. And it is time and labor consuming if manual proofreading of the entered text is used. This has led to an increasing demand for automatic spell checking and correction of text.
Spell correction is a technique for automatically correcting spelling errors in text to produce correct text.
Spelling correction systems in western european languages like english are well established and they rely primarily on word-granularity checking and correction.
Chinese is very different from western and european languages such as English.
First, the number of Chinese characters is very large, and there are over 3000 common Chinese characters. This makes the search space of the chinese error correction system much larger than english.
Moreover, the length of Chinese vocabulary is generally short, and if misspelling occurs, the semantic meaning of the word and context is often greatly influenced.
In the face of such problems, the conventional chinese spelling correction system mostly relies on the confusion set (pronunciation-shape confusion set) of similar pronunciation and similar font to construct the relationship between words with similar pinyin and/or similar font, and limits the search space to the word set with similar pronunciation and/or similar font to the search object (error correction object/check object) so as to reduce the search space. Here, the confusion set refers to a set of candidates similar to words in a corrected sentence in spell correction.
However, the reliance on the set of phonetic-form confusion prevents the conventional chinese spell correction system from handling spelling errors that do not involve similar pronunciations nor similar glyphs, thereby also limiting the error correction performance of the conventional chinese spell correction system.
Accordingly, there remains a need for an improved text correction scheme.
Disclosure of Invention
One technical problem to be solved by the present disclosure is to provide a text error correction scheme that is capable of finding and correcting spelling errors that do not involve similar pronunciations nor similar glyphs.
According to a first aspect of the present disclosure, there is provided a text error correction method including: performing character segmentation on a text to be corrected to obtain a plurality of original characters; generating semantic candidate replacement characters for each Chinese character in a plurality of original characters; and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
Optionally, the step of generating semantic candidate replacement words for each of the plurality of original characters comprises: and obtaining candidate replacement characters for each Chinese character based on the context information in the text to be corrected.
Optionally, the method may further include: calculating the conditional probability of each Chinese character and the semantic candidate replacement characters thereof, wherein the step of carrying out sequencing decoding on the semantic candidate replacement characters to generate an error correction result comprises the following steps: and performing sequencing decoding on the semantic candidate replacement words for each Chinese character according to the conditional probability, and determining the replacement words so as to generate an error correction result.
Optionally, the step of character segmentation of the text to be corrected includes: dividing Chinese characters and punctuations in a text to be corrected into single characters; and/or segmenting phonograms according to words; and/or the numbers are segmented according to a number expression specification.
Optionally, after the step of generating semantic candidate replacement words for each chinese character, the method further includes: and screening semantic candidate replacement words based on the word bank.
Optionally, the step of screening the semantic candidate replacement words based on the lexicon includes: and if the word composed of the first semantic candidate replacement word and the adjacent semantic candidate replacement word does not exist in the word bank, deleting the first semantic candidate replacement word.
Optionally, the step of generating the error correction result text includes: based on the conditional probabilities, alternative word sequences are selected as error correction result text that maximize the joint probability.
Optionally, the step of generating the error correction result text includes: alternative word sequences that maximize the joint probability are selected as error correction result text based on the conditional probability using a beam search method.
Optionally, generating a semantic candidate replacement word for each of the plurality of original characters, further comprising: and further obtaining one or more semantic candidate replacement words based on the similar voice or the similar font aiming at least one original character.
According to a second aspect of the present disclosure, there is provided a voice input method including: receiving voice input by a user; recognizing the received speech as text; performing character segmentation on the text to obtain a plurality of original characters; generating semantic candidate replacement characters for each Chinese character in a plurality of original characters; and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
According to a third aspect of the present disclosure, there is provided a text input method, including: receiving text input by a user; performing character segmentation on the text to obtain a plurality of original characters; generating semantic candidate replacement characters for each Chinese character in a plurality of original characters; and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
According to a fourth aspect of the present disclosure, there is provided an article error correction method, including: extracting text from the article, wherein the text is a sentence or a phrase or a text segment containing a predetermined number of characters; performing character segmentation on the text to obtain a plurality of original characters; generating semantic candidate replacement characters for each Chinese character in a plurality of original characters; and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
According to a fifth aspect of the present disclosure, there is provided a text correction apparatus including: the character segmentation device is used for carrying out character segmentation on the text to be corrected to obtain a plurality of original characters; candidate generating means for generating semantic candidate replacement words for each of a plurality of original characters; and the decoding device is used for sequencing and decoding the semantic candidate replacement words to generate an error correction result.
According to a sixth aspect of the present disclosure, there is provided a text error correction method including: performing character segmentation on the text to obtain a plurality of original characters; for at least one original character, using a language model to respectively obtain one or more candidate replacement characters and conditional probabilities of the original character and the candidate replacement characters thereof based on the context in the text; and determining a selected character from the original character and the candidate replacement character thereof based on the conditional probability for each of the at least one original character in combination with the context, thereby generating an error correction result text, wherein the selected character is used for replacing the corresponding original character in the text.
Optionally, the step of character-segmenting the text includes: and segmenting the ideograms and punctuation marks in the text into single characters.
Optionally, the ideograph is chinese.
Optionally, the step of character-segmenting the text further includes: segmenting phonograms according to words; and/or the numbers are segmented according to a number expression specification.
Optionally, the method may further include: and screening candidate replacement characters based on a word bank before the step of determining the selected characters from the original characters and the candidate replacement characters.
Optionally, the step of filtering candidate replacement characters based on a thesaurus includes: and deleting candidate replacement characters which cannot form words in the word stock with adjacent characters of the original characters and/or candidate replacement characters of the adjacent characters in the text aiming at each of the at least one original character.
Optionally, the step of selecting a selected character from the original character and the candidate replacement characters thereof to generate an error correction result text includes: and selecting a selected character sequence which maximizes the joint probability as the error correction result text based on the conditional probability.
Optionally, the step of selecting a selected character from the original character and the candidate replacement characters thereof to generate an error correction result text includes: selecting a selected character sequence that maximizes a joint probability as the error correction result text based on the conditional probability using a beam search method.
Optionally, the language model is a multi-layer bidirectional LSTM network model or a CNN model.
Optionally, the method may further: the language model is trained using unsupervised corpus.
Optionally, the language model is a multi-layer bidirectional LSTM network model, and the step of training the language model using unsupervised corpus includes: carrying out character segmentation on the unsupervised corpus, and adding a starting mark symbol and an ending mark symbol to form a training character sequence; for a character in the training sequence of characters: inputting forward sequences of characters before the characters into forward LSTM to obtain first hidden layer representation, inputting reverse sequences of characters after the characters into reverse LSTM to obtain second hidden layer representation, splicing the first hidden layer representation and the second hidden layer representation to obtain third hidden layer representation, predicting conditional probability of the characters by using a forward network and softmax based on the third hidden layer representation, and updating the language model by using a reverse propagation algorithm.
Optionally, the text is a sentence or phrase or a text fragment containing a predetermined number of characters.
Optionally, the method may further: one or more candidate replacement characters are further obtained for at least one original character based on the similar voice or the similar font.
According to a seventh aspect of the present disclosure, there is provided a text error correction method including: segmenting the text to obtain a plurality of text elements; for at least one text element, respectively obtaining one or more candidate replacement elements and conditional probabilities of the text element and its candidate replacement elements based on context in the text using a language model; and selecting a selected element from the text element and candidate replacement elements thereof for each of the at least one text element based on the conditional probability in combination with the context, thereby generating an error correction result text, wherein the selected element is used for replacing the corresponding text element in the text.
According to an eighth aspect of the present disclosure, there is provided a voice input method comprising: receiving voice input by a user; recognizing the received speech as text; performing character segmentation on the text to obtain a plurality of original characters; for at least one original character, using a language model to respectively obtain one or more candidate replacement characters and conditional probabilities of the original character and the candidate replacement characters thereof based on the context in the text; and determining a selected character from the original character and the candidate replacement character thereof based on the conditional probability for each of the at least one original character in combination with the context, thereby generating an error correction result text, wherein the selected character is used for replacing the corresponding original character in the text.
According to a ninth aspect of the present disclosure, there is provided a text input method, comprising: receiving text input by a user; performing character segmentation on the text to obtain a plurality of original characters; for at least one original character, using a language model to respectively obtain one or more candidate replacement characters and conditional probabilities of the original character and the candidate replacement characters thereof based on the context in the text; and determining a selected character from the original character and the candidate replacement character thereof based on the conditional probability for each of the at least one original character in combination with the context, thereby generating an error correction result text, wherein the selected character is used for replacing the corresponding original character in the text.
According to a tenth aspect of the present disclosure, there is provided an article error correction method, including: extracting text from the article, wherein the text is a sentence or a phrase or a text fragment containing a predetermined number of characters; performing character segmentation on the text to obtain a plurality of original characters; for at least one original character, using a language model to respectively obtain one or more candidate replacement characters and conditional probabilities of the original character and the candidate replacement characters thereof based on the context in the text; and determining a selected character from the original character and the candidate replacement character thereof based on the conditional probability for each of the at least one original character in combination with the context, thereby generating an error correction result text, wherein the selected character is used for replacing the corresponding original character in the text.
According to an eleventh aspect of the present disclosure, there is provided a text error correction apparatus including: the character segmentation device is used for carrying out character segmentation on the text to obtain a plurality of original characters; candidate generating means for obtaining, for at least one original character, one or more candidate replacement characters and conditional probabilities of the original character and its candidate replacement characters, respectively, based on a context in the text using a language model; and decoding means for determining, for each of the at least one original character, a selected character from among the original character and candidate replacement characters thereof based on the conditional probability in combination with the context, thereby generating an error correction result text in which the selected character is used in place of the corresponding original character.
According to a twelfth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first to fourth, sixth to tenth aspects above.
According to a thirteenth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as described in the first to fourth, sixth to tenth aspects above.
Therefore, the semantic confusion set is provided, the semantic similarity is combined, effective candidates are generated through context semantics, good coverage is provided for spelling errors of non-similar pronunciations and non-similar fonts, and the error correction efficiency and performance can be effectively improved.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 is a schematic diagram of an example of a text correction scheme according to the present disclosure.
Fig. 2 is a schematic flow diagram of a text correction method according to the present disclosure.
Fig. 3 is a schematic block diagram of a text correction apparatus according to the present disclosure.
FIG. 4 is a schematic flow chart of a training method for a language model in the embodiment.
Fig. 5 is a schematic structural diagram of a computing device that can be used to implement the text error correction method according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method provides a semantic confusion set based on a language model, generates candidates by depending on text context semantic information, can not be restricted by the pronunciation and the font, can effectively process spelling errors which are not near to sound and are not near to form, and can be used as effective supplement of the traditional pronunciation and font confusion set.
Fig. 1 is a schematic diagram of an example of a text correction scheme according to the present disclosure.
In the example shown in fig. 1, the user wishes to input "countless amount of wind and rain in life".
The number of the wind and rain in the life is obtained by inputting the voice or the keyboard into the system.
After context-based analysis using the language model, we find:
a candidate replacement (confusion set) for a "person" word may have a "one";
the candidate replacement of the "middle" word can be "inner", "inner";
a candidate replacement for the "wind" word may be "snow";
the candidate replacement for the "exhausted" word may have a "count".
After the conditional probability decoding analysis, it is determined that "best" should be corrected to "count" while the other words are unchanged.
Thereby outputting 'countless weathers in life' as a text error correction result.
The text error correction scheme according to the present disclosure is described in further detail below with reference to fig. 2 to 4.
Fig. 2 is a schematic flow diagram of a text correction method according to the present disclosure.
Fig. 3 is a schematic block diagram of a text correction apparatus according to the present disclosure.
As shown in fig. 3, the text correction apparatus 300 according to the present disclosure may include a character segmentation apparatus 310, a candidate generation apparatus 320, and a decoding apparatus 340. In a preferred embodiment, the word bank filtering device 330 may be further included.
As shown in fig. 2, in step S210, for example, the text to be corrected may be character-cut by the character cutting device 310 to obtain a plurality of original characters.
An "original character" is a character originally in the text.
Here, "text" may be a sentence or a phrase.
Alternatively, in some cases, "text" may also contain a text fragment of a predetermined number of characters. For example, when the whole sentence is long, a part of the sentence (for example, the sentence may be divided by a punctuation mark such as a comma, or may be divided by a specific character) may be cut out as "text" to perform analysis and correction.
When the characters of the text are segmented, the ideograms and punctuations in the text can be segmented into single characters.
In particular, in the context of Chinese input, the ideographs herein may be Chinese, with the corresponding characters being Chinese characters.
On the other hand, when character segmentation is performed on a text, phonetic characters such as English can be segmented according to words.
In addition, for various numbers, the division may be performed by a number expression specification.
For example, for text
'I am at 11 am, eat KFC',
it is segmented into "i | surprise | day |11| point |, | eat | KFC".
Then, in step S220, a semantic candidate replacement word may be generated for each of the plurality of original characters obtained by the segmentation, for example, by the candidate generating device 320.
Here, candidate replacement words may be obtained for each chinese character obtained by the segmentation based on context information in the text to be corrected.
In some embodiments, the conditional probability of each Chinese character and its semantic candidate replacement word may also be calculated for subsequent use in determining the final replacement word (replaceable word).
In some embodiments, a language model may be used to generate semantic candidate replacement words for each Chinese character in the original characters. Alternatively, in some embodiments, a language model may also be used to generate semantic candidate replacement characters for various characters in the original characters, including chinese characters, phonograms, numbers, and the like. In the following description regarding the original character and the semantic candidate replacement character corresponding thereto, if the error correction processing according to the present disclosure is performed only on the chinese character therein, it may be understood as the chinese character in the original character and the semantic candidate replacement character thereof.
For example, for at least one original character (particularly a chinese character therein), a language model may be used to obtain one or more semantic candidate replacement characters and conditional probabilities of the original character and its semantic candidate replacement characters, respectively, based on context in the text.
A "conditional probability" herein may refer to the probability that a character is selected given the context, in particular given the above/preceding character (or sequence of characters). In other words, when different context (especially above/previous) characters are selected, the conditional probability of the current character may be different.
A language model is a machine learning model used to model the probability distribution of a continuous sequence (e.g., text).
The language model used in the embodiments may be various network model structures, such as a multi-layer bidirectional LSTM network model, a CNN model, and the like.
Thus, for the input text, a language model can be used to generate a semantic confusion set (in addition, a character-sound confusion set and/or a character-form confusion set can be attached), and semantic candidates are generated for each Chinese character, namely candidate Chinese characters which are considered to possibly replace the Chinese character according to semantic judgment.
The language model may be trained over unsupervised corpus to efficiently generate semantic candidate replacement characters using contextual semantic information.
The following describes a process for training a language model in a simplified manner with reference to fig. 4, taking a multi-layer bidirectional LSTM network model as an example.
FIG. 4 is a schematic flow chart of a training method for a language model in the embodiment.
First, the unsupervised corpus is character-segmented and a START marker (e.g., "START") and an END marker (e.g., "END") are added to form a training character sequence.
For example, a sentence, "countless amount of wind and rain in life" is inputted as an unsupervised corpus, character-segmented, and a START marker character and an END marker symbol are added to obtain [ START, person, life, middle, wind, rain, countless, county, END ].
Then, each word in the sentence can be predicted using the multi-layer bi-directional LSTM network model.
As shown in fig. 4, the following operations may be performed for characters in a training character sequence, such as the training character sequence text "in" described above, respectively.
In step S410, the forward sequence ([ START, Man, Sheng) of characters preceding the character (e.g., "Zhong") is determined]) Inputting forward LSTM to obtain first hidden layer representation h1
In step S420, the reverse sequence ([ END, number, which counts, does not, rain, wind) of each character following the character (e.g., "middle") is performed]) Inputting reverse LSTM to obtain second hidden layer representation h2
In step S430, the first hidden layer representation h is spliced1And a second hidden layer representation h2To obtain a third hidden layer representation h3=[h1,h2]。
In step S440, based on the thirdHidden layer representation h3The conditional probability of the character ("middle") is predicted, for example, using the forward network and softmax.
The language model is then updated, for example by a back propagation algorithm, at step S450.
Thus, a language model can be trained based on a large number of unsupervised corpora, enabling it to generate candidates efficiently using contextual semantic information.
In actually performing the text error correction processing, for example, for an input sentence, for example, "wind and rain inexhaustible number thereof in life", the START and END flag symbols are added through the character segmentation of step S210, and [ START, person, life, middle, wind, rain, inexhaustible, number, END ] is obtained.
Semantic candidates of each Chinese character can be generated by using the trained language model, for example, except the person, the generated candidates have a 'one' and the like; the candidates of "middle" are "inside", etc.; "wind" candidates are "snow"; among the "exhausted" candidates are "count", "like", and the like.
Alternatively, in step S230, candidate replacement characters may be filtered based on the lexicon, for example, by the lexicon filtering means 330. Step S230 may be performed, for example, before finally determining a selected character from the original character and its semantic candidate replacement characters. Step S230 and the lexicon filtering means 330 are not necessary for implementing the text correction method of the present disclosure, and therefore, are illustrated with a dashed box in the figure.
Here, semantic candidate replacement characters of words in the lexicon that cannot be composed with adjacent characters of the original character and/or semantic candidate replacement characters of adjacent characters in the text may be deleted for each of the at least one original character.
In other words, for one semantic candidate replacement word, which may be referred to as a "first semantic candidate replacement word", for example, if there is no word in the thesaurus that the first candidate replacement word makes up with the adjacent candidate replacement word, the first candidate replacement word may be deleted.
Words in the lexicon may originate from dictionaries, text partitions, manual collections, network automated collections, and so on.
For example, "life" and "lifetime" may both be words in a thesaurus and so may both be preserved. The word stock has no 'endless number' but 'countless number'. So deletion deletes "exhausted" in the candidate set (semantic confusion set), keeping "counts".
In this way, by further screening through the word stock, partial inappropriate candidate replacement characters (or replacement words) can be screened out, and the workload of subsequent decoding processing is reduced.
In addition, if the lexicon is directly used to determine candidates without analysis of the language model in step S230, a very large amount of work is required.
The present disclosure can significantly reduce the workload of subsequent thesaurus lookup and/or decoding processes by using language models.
Then, in step S240, the semantic candidate replacement words may be subjected to sorting decoding by the decoding device 340, for example, to generate an error correction result.
In some embodiments, for example, in the case that the conditional probability of each chinese character and its semantic candidate replacement word is calculated at the same time or after the generation of the semantic candidate replacement word by each chinese character in step S220, in step S240, the alternative word may be determined by performing the sequential decoding of the semantic candidate replacement word on each chinese character according to the conditional probability, so as to generate the error correction result.
For example, in some embodiments, a candidate character (replaceable character) may be determined from the original character and its candidate replacement characters for each of at least one original character based on conditional probabilities in conjunction with the context, thereby generating an error correction result text.
Here, the error correction result text is generated by replacing the corresponding original character with a selected character (alternative word) in the text.
Based on the conditional probability of each of the selected characters, a joint probability of the sequence of selected characters may be calculated. For example, the joint probability for the entire sequence of characters may be the product of the conditional probabilities of each character under the current context conditions.
The adjustment of each character may change the conditional probabilities of the other characters and thus the joint probability of the entire sequence of characters.
Here, for example, a bundle search method may be used to select a selected character sequence that maximizes the joint probability.
Then, a selected character sequence that maximizes the joint probability may be selected as the error correction result text.
In addition, before the joint probability of the character sequence is calculated, particularly for Chinese characters with more replaceable characters, screening can be carried out according to the conditional probabilities of the Chinese characters and the replaceable characters thereof, and partial characters or replaceable characters with higher conditional probabilities are reserved. Therefore, the joint probability of subsequent character sequence calculation can be reduced, the calculation complexity of the final error correction result text can be determined, and the efficiency can be improved.
Therefore, based on context semantics, a semantic confusion set is provided, and spelling errors can be found and corrected without depending on similar fonts or similar pronunciations.
In addition, a pronunciation confusion set or a font confusion set can be further provided based on similar pronunciation and/or similar font. That is, for at least one original character, one or more candidate replacement characters may be further obtained based on similar speech or similar font. This further checks and corrects spelling errors based on pronunciation or font.
In the above, the text error correction method is described mainly by taking the kanji character as an example of the error correction unit. It should be understood that the error correction unit here may also be other text elements, such as words or words, etc.
Therefore, in the text error correction process, the text can be segmented to obtain a plurality of text elements.
For such at least one text element, using a language model, one or more candidate replacement elements and conditional probabilities of the text element and its candidate replacement elements, respectively, may be obtained based on a context in the text. The language model may be substantially the same as the language model described above. Only such text elements are trained.
Then, in conjunction with the context, for each of the at least one text element, a selection element may be selected from the text element and its candidate replacement elements based on the conditional probability, thereby generating an error correction result text in which the corresponding text element is replaced with the selection element in the text.
As an application example, the text error correction method provided by the present disclosure may be applied to error correction of a text obtained by speech recognition.
For example, an error correction option may be provided in association with the speech input module or the speech text entry box for the user to select whether to perform automatic error correction of the speech input text. In the case where the user selects to perform automatic error correction, the text input by the user through voice may be corrected using the text error correction method according to the present disclosure.
Thus, in the voice recognition process, a voice input by a user is received, and the received voice is recognized as a text.
And then, carrying out character segmentation on the text to obtain a plurality of original characters.
And generating semantic candidate replacement words for each Chinese character in the plurality of original characters, and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
In some embodiments, for at least one original character, one or more candidate replacement characters and conditional probabilities of the original character and its candidate replacement characters, respectively, may be obtained based on context in the text, e.g., using a language model.
Then, in conjunction with the context, for each of the at least one original character, a selected character may be determined from the original character and its candidate replacement characters based on the conditional probability, thereby generating an error correction result text in which the selected character is used to replace the corresponding original character.
Alternatively, the text error correction method provided by the present disclosure may also be applied to error correction of a text input by a user through various manners such as a keyboard, a tablet, and the like.
For example, a correction option may be provided in association with a text entry module or text entry box for a user to select whether to perform automatic correction of speech input text. In the case where the user selects to perform automatic error correction, the text error correction method according to the present disclosure may be employed to correct the text input by the user through a keyboard, a tablet, or the like.
In the process that a user inputs a text in various modes, the text input by the user is received, and the text is subjected to character segmentation to obtain a plurality of original characters.
And generating semantic candidate replacement words for each Chinese character in the plurality of original characters, and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
In some embodiments, for at least one original character, one or more candidate replacement characters and conditional probabilities of the original character and its candidate replacement characters, respectively, may be obtained based on context in the text, e.g., using a language model.
Then, in conjunction with the context, for each of the at least one original character, a selected character may be determined from the original character and its candidate replacement characters based on the conditional probability, thereby generating an error correction result text in which the selected character is used to replace the corresponding original character.
Or, the text error correction method provided by the disclosure can also be applied to search and correct errors such as word spelling in the existing article.
The articles herein may be broadly construed to include various forms of textual content that has been entered into a computing device and stored on a storage medium. May be a paper, letter, advertisement text, notice, novel. The "article" may or may not be complete.
When performing article error correction, text may be extracted from the article. The text is a sentence or phrase or a segment of text containing a predetermined number of characters.
And performing character segmentation on the text to obtain a plurality of original characters.
And generating semantic candidate replacement words for each Chinese character in the plurality of original characters, and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
In some embodiments, for at least one original character, one or more candidate replacement characters and conditional probabilities of the original character and its candidate replacement characters, respectively, may be obtained based on context in the text, e.g., using a language model.
Then, in conjunction with the context, for each of the at least one original character, a selected character may be determined from the original character and its candidate replacement characters based on the conditional probability, thereby generating an error correction result text in which the selected character is used to replace the corresponding original character.
Thereby, the error correction result text is used instead of the corresponding text in the article. The text error correction process is completed for all the text (sentences, phrases or text segments) which can be extracted or is expected to be extracted in the article, so that the error correction of the whole article can be realized.
The method provides a semantic confusion set based on a language model, generates candidates (the semantic confusion set) by depending on text context semantic information, is not restricted by character pronunciation/character patterns, and can effectively process spelling errors which are not near to sound and are not near to shape.
In addition, the pronunciation and font confusion set can be further combined, pronunciation similarity and font similarity are considered, and the error correction efficiency and performance can be further improved.
Therefore, the technical scheme of the disclosure can be used as an effective supplement to the traditional character pronunciation and font confusion set, and meanwhile, the manual intervention to the pinyin font relation can be greatly reduced.
Fig. 5 is a schematic structural diagram of a computing device that can be used to implement the text error correction method according to an embodiment of the present invention.
Referring to fig. 5, computing device 500 includes memory 510 and processor 520.
The processor 520 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 520 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 520 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory 510 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 520 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 510 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 510 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 510 has stored thereon executable code, which when processed by the processor 520, causes the processor 520 to perform the methods described above.
The text error correction scheme according to the present invention has been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (15)

1. A text error correction method comprising:
performing character segmentation on a text to be corrected to obtain a plurality of original characters;
generating semantic candidate replacement characters for each Chinese character in the plurality of original characters; and
and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
2. The method of claim 1, wherein generating semantic candidate replacement words for each of the plurality of raw characters comprises:
and obtaining candidate replacement characters for each Chinese character based on the context information in the text to be corrected.
3. The method of claim 1, further comprising:
calculating the conditional probability of each Chinese character and the semantic candidate replacement characters thereof,
the step of performing sorting decoding on the semantic candidate replacement words to generate an error correction result comprises the following steps:
and sequencing and decoding the semantic candidate replacement words of each Chinese character according to the conditional probability, and determining the replacement words so as to generate an error correction result.
4. The method of claim 1, wherein the character-slicing the text to be corrected comprises:
dividing Chinese characters and punctuations in a text to be corrected into single characters; and/or
Segmenting phonograms according to words; and/or
And segmenting the numbers according to the number expression specification.
5. The method of claim 1, further comprising:
and screening semantic candidate replacement words based on the word bank.
6. The method of claim 5, wherein the step of filtering semantic candidate replacement words based on a lexicon comprises:
and if the word composed of the first semantic candidate replacement word and the adjacent semantic candidate replacement word does not exist in the word stock, deleting the first semantic candidate replacement word.
7. The method of claim 1, wherein the generating of the error correction result text comprises:
based on the conditional probabilities, alternative word sequences that maximize joint probabilities are selected as the error correction result text.
8. The method of claim 1, wherein the generating of the error correction result text comprises:
selecting, as the error correction result text, an alternative word sequence that maximizes a joint probability based on the conditional probabilities using a bundle search method.
9. The method of claim 1, wherein generating semantic candidate replacement words for each of the plurality of original characters, further comprises:
and further obtaining one or more semantic candidate replacement words based on the similar voice or the similar font aiming at least one original character.
10. A voice input method comprising:
receiving voice input by a user;
recognizing the received speech as text;
performing character segmentation on the text to obtain a plurality of original characters;
generating semantic candidate replacement characters for each Chinese character in the plurality of original characters; and
and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
11. A text entry method comprising:
receiving text input by a user;
performing character segmentation on the text to obtain a plurality of original characters;
generating semantic candidate replacement characters for each Chinese character in the plurality of original characters; and
and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
12. An article error correction method comprising:
extracting text from the article, wherein the text is a sentence or a phrase or a text fragment containing a predetermined number of characters;
performing character segmentation on the text to obtain a plurality of original characters;
generating semantic candidate replacement characters for each Chinese character in the plurality of original characters; and
and sequencing and decoding the semantic candidate replacement words to generate an error correction result.
13. A text correction apparatus comprising:
the character segmentation device is used for carrying out character segmentation on the text to be corrected to obtain a plurality of original characters;
candidate generating means for generating semantic candidate replacement words for each of the plurality of original characters; and
and the decoding device is used for sequencing and decoding the semantic candidate replacement words to generate an error correction result.
14. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 12.
15. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-12.
CN202010646254.6A 2020-07-07 2020-07-07 Text error correction method and device Pending CN113919326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010646254.6A CN113919326A (en) 2020-07-07 2020-07-07 Text error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010646254.6A CN113919326A (en) 2020-07-07 2020-07-07 Text error correction method and device

Publications (1)

Publication Number Publication Date
CN113919326A true CN113919326A (en) 2022-01-11

Family

ID=79231291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010646254.6A Pending CN113919326A (en) 2020-07-07 2020-07-07 Text error correction method and device

Country Status (1)

Country Link
CN (1) CN113919326A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885828A (en) * 2019-01-14 2019-06-14 平安科技(深圳)有限公司 Word error correction method, device, computer equipment and medium based on language model
CN111090986A (en) * 2019-11-29 2020-05-01 福建亿榕信息技术有限公司 Method for correcting errors of official document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885828A (en) * 2019-01-14 2019-06-14 平安科技(深圳)有限公司 Word error correction method, device, computer equipment and medium based on language model
CN111090986A (en) * 2019-11-29 2020-05-01 福建亿榕信息技术有限公司 Method for correcting errors of official document

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JUNJIE YU 等: "《Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape》", 《PROCEEDINGS OF THE THIRD CIPS-SIGHAN JOINT CONFERENCE ON CHINESE LANGUAGE PROCESSING》, 31 December 2014 (2014-12-31), pages 220 - 223 *

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
US9471566B1 (en) Method and apparatus for converting phonetic language input to written language output
KR101083540B1 (en) System and method for transforming vernacular pronunciation with respect to hanja using statistical method
TWI435225B (en) Typing candidate generating method for enhancing typing efficiency
US20090150139A1 (en) Method and apparatus for translating a speech
CN105279149A (en) Chinese text automatic correction method
CN110750993A (en) Word segmentation method, word segmentation device, named entity identification method and system
JPH07325828A (en) Grammar checking system
Mehmood et al. An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis
CN103678271A (en) Text correction method and user equipment
CN112507734A (en) Roman Uygur language-based neural machine translation system
CN102135956B (en) A kind of Tibetan language segmenting method based on lexeme mark
CN111460809B (en) Arabic place name proper name transliteration method and device, translation equipment and storage medium
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
JP2002149643A (en) Method for predicting reading of japanese ideogram
CN113255329A (en) English text spelling error correction method and device, storage medium and electronic equipment
JPH08263478A (en) Single/linked chinese character document converting device
CN116562240A (en) Text generation method, computer device and computer storage medium
US8977538B2 (en) Constructing and analyzing a word graph
CN113919326A (en) Text error correction method and device
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
CN113240485A (en) Training method of text generation model, and text generation method and device
JP3398729B2 (en) Automatic keyword extraction device and automatic keyword extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination