CN110502754B - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN110502754B
CN110502754B CN201910791618.7A CN201910791618A CN110502754B CN 110502754 B CN110502754 B CN 110502754B CN 201910791618 A CN201910791618 A CN 201910791618A CN 110502754 B CN110502754 B CN 110502754B
Authority
CN
China
Prior art keywords
text
word
candidate
corrected
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910791618.7A
Other languages
Chinese (zh)
Other versions
CN110502754A (en
Inventor
方俊
林炳怀
黄江泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910791618.7A priority Critical patent/CN110502754B/en
Publication of CN110502754A publication Critical patent/CN110502754A/en
Application granted granted Critical
Publication of CN110502754B publication Critical patent/CN110502754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application relates to a text processing method and a text processing device, wherein the method comprises the following steps: acquiring an initial text to be corrected; carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text; respectively carrying out syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text; performing language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score; and determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text. The scheme provided by the application can improve the text error correction accuracy.

Description

Text processing method and device
Technical Field
The present application relates to the field of language processing technologies, and in particular, to a text processing method and apparatus.
Background
With the development of computer technology, language correction technology has emerged, which refers to correcting unreasonable combination patterns existing in sentences, such as "a orange" in english sentences, and indefinite articles of singular nouns obviously beginning with vowels should be "an" instead of "a".
Traditional language correction techniques are rule-based, and if good results are required, many complex rules need to be established, and the method is surpassed by deep learning models at present. At present, most deep learning models adopt a forward coding and decoding mode. In most scenarios, the error correction effect of forward decoding is significant, since the language itself is created as such. In some scenarios, however, some information in a language is inversely related, such as the article "a/an" in english, which is often determined by the form of the following word. Therefore, the current language error correction method often has the problem of insufficient error correction accuracy due to omission of some information.
Disclosure of Invention
Therefore, it is necessary to provide a text processing method, a text processing device, a computer-readable storage medium, and a computer device for solving the technical problem that the current language correction method has insufficient correction accuracy.
A text processing method, comprising:
acquiring an initial text to be corrected;
carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text;
respectively carrying out syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
performing language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score;
and determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text.
A text processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring an initial text to be corrected;
the spelling error correction module is used for carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text;
the grammar error correction module is used for respectively carrying out grammar error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
the language evaluation module is used for carrying out language evaluation processing on the basis of the first candidate text and the second candidate text to obtain an evaluation score;
and the determining module is used for determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring an initial text to be corrected;
carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text;
respectively carrying out syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
performing language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score;
and determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring an initial text to be corrected;
carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text;
respectively carrying out syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
performing language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score;
and determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text.
According to the text processing method, the text processing device, the computer readable storage medium and the computer equipment, spelling error correction processing is performed on the initial text to be corrected to obtain the corresponding corrected text, so that the difficulty of subsequent grammar error correction is reduced, and the error correction correctness is improved. And performing grammar error correction processing on the corrected text respectively through a pre-trained forward codec and a pre-trained backward codec to obtain a first candidate text and a second candidate text, and generating an error-corrected target text for the initial text based on evaluation scores corresponding to the first candidate text and the second candidate text. Therefore, through parallel forward and reverse decoding processing, forward decoding information and reverse decoding information are considered, syntax error accumulation caused by continuous decoding is avoided, and the accuracy of error correction on the initial text is greatly improved.
Drawings
FIG. 1 is a diagram of an application environment of a text processing method in one embodiment;
FIG. 2 is a flow diagram that illustrates a method for text processing in one embodiment;
FIG. 3 is a flowchart illustrating a text processing method according to another embodiment;
FIG. 4 is a flowchart illustrating a text processing method according to yet another embodiment;
FIG. 5 is a flowchart illustrating steps of performing spell correction processing on an initial text to obtain a corresponding corrected text in one embodiment;
FIG. 6 is an interface presentation of text correction in one embodiment in an embodiment;
FIG. 7 is a flowchart showing a text processing method in still another embodiment;
FIG. 8 is a flowchart illustrating the steps of the spell correction process in one embodiment;
FIG. 9 is a flowchart illustrating a method for text processing in an exemplary embodiment;
FIG. 10 is a block diagram showing a configuration of a text processing apparatus according to an embodiment;
FIG. 11 is a block diagram showing a configuration of a text processing apparatus according to another embodiment;
FIG. 12 is a block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
FIG. 1 is a diagram of an application environment of a text processing method in one embodiment. Referring to fig. 1, the text processing method is applied to a text processing system. The text processing system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. Both the terminal 110 and the server 120 can be independently used to perform the text processing method provided in the embodiments of the present application. The terminal 110 and the server 120 may also be cooperatively used to execute the text processing method provided in the embodiment of the present application.
It should be noted that the embodiments of the present application relate to various machine learning models. The machine learning model is a model having a certain ability after learning by a sample. In the embodiment of the application, one machine learning model is a spelling error correction model with spelling error correction capability through sample learning. Another machine learning model is the forward codec with syntax error correction capability by sample learning. Yet another machine learning model is the inverse codec with syntax error correction capability through sample learning. Yet another machine learning model is a language assessment model with language assessment capabilities through sample learning. Where language assessment is the process of measuring the language plausibility of the input text.
The machine learning model may adopt a Neural network model, such as a CNN (Convolutional Neural network) model, an RNN (Recurrent Neural network) model, or a transform model. Of course, other types of models may be used for the machine learning model, and the embodiments of the present application are not limited herein.
The codec is a deep learning transform model. The codec includes an encoder for extracting information from the input text and a decoder for decoding the effective information from the information extracted by the encoder and removing noise to obtain the output text. The encoder inputs the whole sentence, and the decoder decodes only one word at a time, and the whole sentence can be obtained through multiple decoding with time sequence. When the decoder decodes in the forward direction in left-to-right order (i.e., the reading order of the text), the codec is a forward codec. When the decoder decodes in reverse in the right-to-left order (i.e., the reverse order of the text), the codec is a reverse codec.
As shown in FIG. 2, in one embodiment, a method of text processing is provided. The embodiment is mainly illustrated by that the method is executed by a computer device, and the computer device may specifically be the terminal 110 or the server 120 in fig. 1. Referring to fig. 2, the text processing method specifically includes the following steps:
s202, acquiring an initial text to be corrected.
The initial text is an initial text without error correction, and there may be noise in the initial text, such as a grammar error in a sentence in the initial text or a spelling error in a word in the initial text. The initial text may specifically be an article, a paragraph, a sentence or a word, etc. The initial text may be a text in an english language, a text in a chinese language, or a text in a french language, and the like, and the embodiment of the present application is not limited herein. It is understood that when the initial text is a text in a certain language, the corresponding model for processing the initial text is also based on the training in the same language. Such as the spell correction model, forward codec, reverse codec, and language assessment model, etc., as mentioned below.
Specifically, a computer device may obtain initial text to be corrected that is input locally or communicated to the computer device over a network. When the computer equipment is a terminal, the terminal can display the text input box, a user can input the initial text to be corrected into the text input box through equipment such as an external keyboard, and the terminal obtains the initial text. When the computer device is a server, the server can receive the initial text fed back by the terminal through network connection.
And S204, carrying out spelling error correction processing on the initial text to obtain a corresponding corrected text.
The spell correction processing is a process of correcting a word or phrase in which a misspelling occurs in an initial text. Specifically, the computer device may perform spell correction processing on the initial text to obtain a corrected text that corrects the initial text. For example, the computer device may perform spell correction processing on the initial text through a machine learning model, or perform spell correction processing on the initial text through a table lookup correction, or perform spell correction processing on the initial text in combination with multiple manners, and the like, and the embodiments of the present application are not limited herein.
In one embodiment, the computer device may segment the whole sentence in the initial text into individual initial words, for example, the computer device may perform a tokenization (tokenized) process on the whole sentence in the initial text, and separate each word in the sentence from the punctuation mark by a preset character (preset character, such as a space or quotation mark, etc.) to form a plurality of initial words. For the abbreviated part in the sentence, splitting processing can also be performed, for example, "I'm" in the english text can be split into "I" and "m". Furthermore, the computer device can perform spelling error correction processing on the initial word obtained after splitting, and correct the word with misspelling to obtain a corrected word. Therefore, the computer equipment can splice the unmodified words and the modified words in the initial words to obtain the modified text.
In one embodiment, the computer device may perform spell correction processing on the initial text by performing table lookup corrections. For example, the computer device may sequentially search for an initial word in an initial text, and when the initial word is found in a preset word list, determine that the initial word is a correct word; and when the initial word is not found in the preset word list, determining the word as a word to be corrected. Therefore, the word with the minimum editing distance with the word to be corrected is selected to replace the word to be corrected, and the effect of spelling error correction is achieved.
In one embodiment, the computer device may perform spell correction processing on the initial text through a spell correction model. The computer device may train the spell correction model in advance through the sample text to be corrected and the corrected tag text, such that the spell correction model has a spell correction capability. And further processing the initial text through a trained good spelling correction model to obtain a corrected text. The training steps and the specific execution steps of the spell correction model will be described in detail in the following embodiments.
S206, grammar error correction processing is respectively carried out on the corrected text through the pre-trained forward codec and the pre-trained reverse codec, and a first candidate text and a second candidate text are obtained.
Specifically, the computer device may input the modified text obtained through the spell correction into the forward codec, encode the initial text through the forward codec, and perform a forward decoding operation to obtain a first candidate text. The computer device may input the modified text obtained through the spell correction into the inverse codec, encode the initial text through the inverse codec, and perform an inverse decoding operation to obtain a second candidate text.
In an embodiment, the step S206, that is, performing syntax error correction processing on the modified text through the pre-trained forward codec and the pre-trained backward codec to obtain the first candidate text and the second candidate text specifically includes: carrying out compression coding processing on the corrected text to obtain a coded text; respectively inputting the coded texts into a pre-trained forward codec and a pre-trained reverse codec; carrying out syntax error correction processing on an input coded text through a forward codec to obtain a corresponding first output text; carrying out syntax error correction processing on the input coded text through a reverse codec to obtain a corresponding second output text; and respectively carrying out decompression decoding processing on the first output text and the second output text to obtain a first candidate text and a second candidate text.
The compression encoding is an encoding method capable of compressing data size, such as BPE (byte pair encoder). BPE coding, which may also be called double-alphabet coding, is mainly used for data compression. The BPE coding is mainly a layer-by-layer iterative process in which a pair of characters in a string, which is most frequently found, is replaced by a character that does not appear in the character. For example, when the word in the initial text is "student", the character "stu" can be replaced with the character "a" and the character "dent" can be replaced with the character "B", then the word "student" can be encoded as "AB". It can be understood that the computer device can perform compression coding in units of words, phrases, sentences and the like to obtain a coded text corresponding to the whole corrected text. The compression coding mentioned in the embodiment of the present application may also be implemented by other coding methods as long as the function of compression can be implemented, and the embodiment of the present application is not limited herein.
The decompression decoding is a decoding method corresponding to compression encoding. That is, when the compression encoding method is BPE encoding, the corresponding decompression decoding may be BPE decoding. Accordingly, the character string "AB" obtained by the above-mentioned compression encoding method can be correspondingly decoded to obtain "student".
Specifically, the computer device may segment the whole sentence in the corrected text into single words, and perform compression coding on each word in the corrected text, thereby obtaining a coded text corresponding to the corrected text. Further, the computer equipment can respectively input the coded text into a pre-trained forward codec and a pre-trained reverse codec, and syntax error correction processing is carried out on the input coded text through the forward codec to obtain a corresponding first output text; and carrying out syntax error correction processing on the input coded text through a reverse codec to obtain a corresponding second output text. And performing corresponding decompression decoding processing on the first output text and the second output text respectively to obtain a first candidate text and a second candidate text.
In one embodiment, the computer device inputs the encoded text into the forward codec, performs semantic encoding processing on the modified text through an encoder in the forward codec, and extracts corresponding text information. The decoder performs decoding processing on the decoded word output last time and the text extracted by the current encoding to obtain the current decoded word. The current decoded word is available for the next decoding process. And decoding in sequence to obtain a first output text corresponding to the corrected text. The first output text is the text with the noise removed, that is, the text after the grammar error correction processing. The specific encoding and decoding modes of the encoder and the decoder may also adopt other modes, such as a processing mode that combines an attention mechanism, and the like, and the embodiment of the present application is not limited herein. It can be understood that the processing mode of the reverse codec is similar to that of the forward codec, except that the decoder in the reverse codec decodes to obtain a word at the rear end of the sentence first when decoding, and decodes to obtain a word at the front end of the sentence according to the word at the rear end of the sentence, so as to obtain the second output text.
In one embodiment, a computer device may obtain a first noisy text sample having a grammatical error and a corresponding first reference text with the grammatical error corrected. The computer device may input the first noisy text sample and the corresponding first reference text as a corpus pair into the forward codec, outputting a predicted text. The forward codec is trained by adjusting model parameters by the difference between the predicted text and the first reference text. And stopping training when the training stopping condition is met, and obtaining the trained forward codec. The training stopping condition may specifically be that the difference is smaller than a preset difference, or that the number of iterations is reached, or the like. It can be understood that the training mode of the reverse codec is the same as the training mode of the forward codec, and the embodiments of the present application are not described herein again.
In the above embodiment, the computer device may perform compression encoding on the modified text and then input the modified text to the forward codec and the reverse codec, respectively. And then carrying out decompression decoding processing on the first output text and the second output text output by the forward coder-decoder and the backward coder-decoder to obtain a first candidate text and a second candidate text. Thus, the processing pressure of the forward codec and the backward codec can be greatly reduced by performing the compression coding processing before the syntax error correction processing is performed, thereby improving the text processing efficiency.
And S208, performing language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score.
Among them, the language evaluation processing is processing for measuring the language reasonableness of the input text. Specifically, the computer device may perform language evaluation processing based on the first candidate text and the second candidate text through the language evaluation model to obtain an evaluation score of the input text input into the language evaluation model. Wherein the evaluation score measures a score of the language reasonableness of the input text. For example, the higher the evaluation score is, the higher the degree of reasonableness of the input text is; the lower the evaluation score is set, the lower the degree of reasonableness of the text to be input is represented. Of course, the reverse arrangement is also possible, and the embodiments of the present application are not limited herein.
In one embodiment, the computer device may perform language evaluation processing on the first candidate text and the second candidate text respectively through a language evaluation model to obtain evaluation scores corresponding to the first candidate text and the second candidate text respectively. Therefore, the computer equipment can screen out the target text with better reasonable degree according to the respective evaluation scores of the first candidate text and the second candidate text.
During processing, the language evaluation model may specifically calculate probability values of respective words appearing in the text scene in the input candidate text, and determine the probability values corresponding to the input candidate text based on the probability values corresponding to the respective words. The probability value can measure the reasonable degree of the input candidate text. The larger the probability value, the more reasonable the representation. An evaluation score corresponding to the input candidate text may thus be determined based on the probability value, such as multiplying the probability value by a value of one hundred to obtain the evaluation score.
And S210, determining a target text corresponding to the initial text according to the evaluation score and the first candidate text and the second candidate text.
Specifically, the computer device may screen out the target text from the first candidate text and the second candidate text according to the evaluation score. Or the computer equipment can recombine each word in the first candidate text and the second candidate text according to the evaluation score to obtain the target text corresponding to the initial text.
In one embodiment, the text processing method comprises the steps of:
s302, acquiring an initial text to be corrected.
S304, spelling error correction processing is carried out on the initial text to obtain a corresponding corrected text.
S306, grammar error correction processing is respectively carried out on the corrected text through the pre-trained forward codec and the pre-trained reverse codec, and a first candidate text and a second candidate text are obtained.
And S308, respectively carrying out language evaluation processing on the first candidate text and the second candidate text through a language evaluation model to obtain evaluation scores corresponding to the first candidate text and the second candidate text.
S310, screening out candidate texts of which the corresponding evaluation scores meet the first target condition from the first candidate texts and the second candidate texts as target texts corresponding to the initial texts.
Specifically, the computer device may filter out, from the first candidate text and the second candidate text, candidate texts whose respective evaluation scores satisfy the first target condition as target texts corresponding to the initial text. The evaluation score meeting the first target condition may specifically be that the evaluation score is higher, or the evaluation score is higher after being processed by a corresponding weight.
In one embodiment, the evaluation score comprises a first evaluation score corresponding to the first candidate text and a second evaluation score corresponding to the second candidate text; screening out candidate texts of which the corresponding evaluation scores meet the first target condition from the first candidate texts and the second candidate texts as target texts corresponding to the initial text, wherein the screening comprises the following steps: multiplying the second evaluation score by a preset weight to obtain a third evaluation score; the preset weight is smaller than a numerical value one; when the first evaluation score is larger than or equal to the third evaluation score, the first candidate text is used as a target text corresponding to the initial text; and when the first evaluation score is smaller than the third evaluation score, the second candidate text is taken as the target text corresponding to the initial text.
Specifically, the computer device may multiply a second evaluation score corresponding to the second candidate text by a preset weight to obtain a third evaluation score; the predetermined weight is smaller than a value of one, such as 0.8. When the first evaluation score is greater than or equal to the third evaluation score, the computer device may treat the first candidate text as a target text corresponding to the initial text. When the first evaluation score is less than the third evaluation score, the computer device may treat the second candidate text as a target text corresponding to the initial text.
In one embodiment, forward decoding may be considered more reliable based on the language being expressed forward from left to right, but backward decoding may cover a portion of the problem that forward decoding is difficult to cover. Since more noise is introduced by the backward decoding, the second evaluation score corresponding to the second candidate text obtained by the backward decoding may be weighted down, so that the backward decoding result, that is, the second candidate text, is selected as the target text only when the backward decoding result is significantly better than the forward decoding result.
In the above embodiment, the weight is reduced for the second evaluation score corresponding to the second candidate text obtained by the reverse decoding, so that the reverse decoding result is selected as the target text only when the reverse decoding result is significantly better than the forward decoding result, and the reliability and accuracy of the target text are greatly improved on the basis of comprehensively considering the forward decoding result and the reverse decoding result.
In one embodiment, the computer device may multiply a first evaluation score corresponding to the first candidate score and a second evaluation score corresponding to the second candidate score by different weights, respectively. The weight may be determined according to the respective confidence levels of the first candidate score and the second candidate score. For example, the results obtained by forward decoding are generally more reliable than the results obtained by reverse decoding, so the first evaluation score may be weighted higher than the second evaluation score. So that the weighted scores can be compared to take the candidate text with high score as the target text.
In one embodiment, the computer device may segment the whole sentence in the initial text into individual initial words before performing spell correction, such as performing tokenization (tokenized) on the whole sentence in the initial text, and separating each word in the sentence from punctuation by preset characters (preset characters, such as spaces or quotation marks, etc.) to form a plurality of initial words. And further performing spelling error correction processing and grammar error correction processing. The first candidate text and the second candidate text processed by the forward codec and the reverse codec are still tokenized. The target text determined by the computer device based on the first candidate text and the second candidate text is the normal output after the sentence is de-tokenized.
According to the text processing method, spelling error correction processing is carried out on the initial text to be corrected, so that the corresponding corrected text is obtained, the difficulty of subsequent grammar error correction is reduced, and the error correction correctness is improved. And performing grammar error correction processing on the corrected text respectively through a pre-trained forward codec and a pre-trained backward codec to obtain a first candidate text and a second candidate text, and generating an error-corrected target text for the initial text based on evaluation scores corresponding to the first candidate text and the second candidate text. Therefore, through parallel forward and reverse decoding processing, forward decoding information and reverse decoding information are considered, syntax error accumulation caused by continuous decoding is avoided, and the accuracy of error correction on the initial text is greatly improved.
In one embodiment, referring to fig. 4, the text processing method includes the steps of:
s402, acquiring an initial text to be corrected.
S404, spelling error correction processing is carried out on the initial text to obtain a corresponding corrected text.
S406, syntax error correction processing is respectively carried out on the corrected text through the pre-trained forward codec and the pre-trained reverse codec, so that a first candidate text and a second candidate text are obtained.
S408, comparing the initial text with the first candidate text, and determining a first difference set formed by the differences between the initial text and the first candidate text.
In particular, the computer device may compare the initial text with the first candidate text, determine differences between the initial text and the first candidate text, and form the differences that exist into a first set of differences. For example, if the initial text is "the term here not for money and for life", and the first candidate text is "the term here not for money but for life", the difference between the initial text and the first candidate text includes a word pair consisting of "center" and "com", and a word pair consisting of "and" but ".
S410, comparing the initial text with the second candidate text, and determining a second difference set formed by the differences between the initial text and the second candidate text.
In particular, the computer device may compare the initial text with the second candidate text, determine differences between the initial text and the second candidate text, and form the differences that exist into a second difference set. For example, if the initial text is "the term here not for money and life", and the second candidate text is "the term here not for money and life", the differences between the initial text and the first candidate text include the word pair consisting of "ceme" and "com" and the word pair consisting of "for life" and "life".
S412, determining more than one group of combined texts corresponding to the initial texts by combining the differences in the first difference set and the second difference set.
Specifically, the computer device may combine the differences in the first set of differences and the second set of differences to determine all of the combined text corresponding to the initial text. For example, the computer device may obtain a plurality of combined texts by performing a plurality of combinations according to the word pair consisting of "ceme" and "come", the word pair consisting of "and" but ", and the word pair consisting of" for life "and" life ". Such as combined text 1: "the y com here not for money and for life", composite text 2 "the y com here not for money but for life", composite text 3 "the y com here not for money and life", and the like.
And S414, respectively carrying out language evaluation processing on each combined text through the language evaluation model to obtain the evaluation score corresponding to each combined text.
Specifically, the computer device may input each combined text into the language evaluation model for language evaluation processing, so as to obtain an evaluation score corresponding to each combined text.
And S416, taking the combined text with the evaluation score meeting the second target condition as the target text corresponding to the initial text.
Specifically, the computer device may set, as the target text corresponding to the initial text, the combined text whose evaluation score satisfies the second target condition. The second target condition may specifically be a first name or a preset top N names of the evaluation scores sorted from large to small.
In the embodiment, the most reasonable combined text is selected from the combined texts combined with the differences through the language evaluation model to serve as the target text after the initial text is modified, the forward encoding and decoding result and the reverse encoding and decoding result can be integrated, the target text with higher accuracy after error correction is selected, and the error correction accuracy is greatly improved.
In one embodiment, the step S204, that is, performing spell correction processing on the initial text to obtain a corresponding corrected text specifically includes the following steps:
s502, the whole sentence in the initial text is cut into single initial words, and at least one group of multi-element groups formed by the initial words are determined.
Specifically, the computer device may segment the whole sentence in the initial text into single initial words, for example, the computer device may perform tokenized processing on the whole sentence in the initial text, and separate each word in the sentence from punctuation marks by preset characters (preset characters, such as spaces or quotation marks, etc.), so as to form a plurality of initial words. For the abbreviated part in the sentence, splitting processing can also be performed, for example, "I'm" in the english text can be split into "I" and "m".
Furthermore, the computer device can determine the multi-element group corresponding to the whole sentence according to each initial word in the whole sentence. Wherein the tuple is a combination of the plurality of initial words determined based on the position of each initial word in the whole sentence. For example, the computer device can determine a list of n-grams to which the whole sentence corresponds based on n-gram (n-gram) principles. For example, when n is 3, the computer device may determine a triplet consisting of the initial word. For example, for The entire sentence whose initial text is "The cat is on The table". The computer device may tokenize to get a list: [ the, cat, is, on, the, table ]. And determining a corresponding n-gram list as follows based on the n-gram: [ the ], [ the, cat, is ], [ cat, is, on ], [ is, on, the ], [ on, the, table ], [ the, the table ].
In one embodiment, the computer device may obtain a pre-trained spell correction model, input the initial text into the trained spell correction model, and perform a spell correction process on the initial text through the spell correction model to obtain a corresponding corrected text. That is, the computer device may segment the entire sentence in the initial text into individual initial words through the spell correction model and determine at least one set of tuples consisting of the initial words.
In one embodiment, a computer device may pre-fetch a corpus comprising a second noisy text sample with a spelling error and a corresponding second misspelled corrected reference text. The computer device may input the second noisy text sample into a spell correction model to be trained, perform spell correction processing on the input first noisy text sample through the spell correction model, and output a predicted spell corrected text. The model parameters are adjusted by comparing differences between the predicted spell correction text and the second reference text to train the spell correction model. And stopping training when the training stopping condition is met, and obtaining the trained spelling error correction model. The training stopping condition may specifically be that the difference is smaller than a preset difference, or that the number of iterations is reached, or the like.
S504, based on the tuple, calculating a first context probability value of each initial word appearing in the whole sentence.
Specifically, for each complete sentence in the initial text, the computer device may process the initial text to obtain at least one group of tuples corresponding to each complete sentence. The computer device may then count conditional probability values for each element in each tuple appearing in the tuple based on the at least one tuple, and may determine a first context probability value for each initial word appearing in the sentence according to a product of the probability values for each tuple appearing in the tuple and the conditional probability values for each element in the tuple appearing in the tuple.
In one embodiment, the computer device may segment the entire sentence in the initial text into individual initial words via a spell correction model and determine at least one set of tuples consisting of the initial words. And calculating a first context probability value of each initial word appearing in the whole sentence based on the tuple through the spell correction model, wherein the first context probability value is a middle calculation result of the spell correction model.
S506, when the first context probability value is smaller than the preset probability value, determining the initial word as a word to be corrected and determining a candidate word corresponding to the word to be corrected.
Specifically, the computer device may determine the size of a first context probability value corresponding to the initial word and a preset probability value, and when the first context probability value is smaller than the preset probability value, may determine that the initial word is a word to be corrected. That is, when the first context probability value is smaller than the preset probability value, it may be considered that the probability that the initial word appears in the current text is very low, and it is highly likely to be wrong, that is, the initial word is a word to be corrected. When the first context probability value is greater than the preset probability value, the probability that the initial word appears in the current text is considered to be very high, and the initial word is highly likely to be correct without correction.
Further, the computer device may determine candidate words respectively corresponding to the words to be corrected from a preset word list. In one embodiment, the computer device may determine an edit distance between a word in the preset word list and a word to be corrected, sort the words from small to large according to the edit distance, and select a word ranked at the second preset name as a candidate word for the word to be corrected. It is to be understood that the number of candidate words corresponding to one word to be corrected may be one or more than one, and the embodiment of the present application is not limited herein.
And S508, calculating a second context probability value of the candidate word appearing in the whole sentence based on the tuple.
Specifically, the computer device may replace the candidate word with the corresponding word to be corrected to form a new tuple. And based on the context scene provided by the multi-element group, counting the conditional probability value of the candidate word appearing in the multi-element group in the new multi-element group, so as to determine the second context probability value of the candidate word appearing in the whole sentence according to the product of the probability value of the multi-element group appearing and the conditional probability value of the candidate word appearing in the multi-element group.
In one embodiment, the computer device may calculate, by the spell correction model, a second context probability value for the candidate word appearing in the whole sentence based on the tuple and the candidate word corresponding to the word to be corrected in the tuple, the second context probability value being an intermediate calculation result of the spell correction model.
S510, according to the second context probability value, selecting a correction word meeting the correction condition from the candidate words, and outputting a correction text corresponding to the initial text based on the correction word.
Specifically, for each word to be corrected, the computer device may screen out a corrected word meeting the correction condition from each candidate word according to the second context probability value corresponding to each candidate word, and replace the corresponding word to be corrected with the corrected word to obtain a corrected text corresponding to the initial text. The condition that satisfies the correction condition may be a maximum second context probability value among second context probability values corresponding to the candidate words.
In one embodiment, the step S510 of selecting a corrected word satisfying the correction condition from the candidate words according to the second context probability value, and outputting a corrected text corresponding to the initial text based on the corrected word includes: determining the editing distance between the word to be corrected and each corresponding candidate word; determining an editing probability value corresponding to each candidate word according to the editing distance; determining a correction probability value according to a second context probability value and an editing probability value corresponding to each candidate word; for each word to be corrected, screening out a corrected word meeting the correction condition from the candidate words according to the correction probability value of the corresponding candidate word; and outputting a corrected text corresponding to the initial text based on the corrected words.
Specifically, the computer device may determine an edit distance between the word to be corrected and each of the corresponding candidate words; and determining the editing probability value corresponding to each candidate word according to the editing distance. Specifically, there are various ways of determining the editing probability value according to the editing distance, for example, the editing probability value may be obtained by dividing a preset value by the editing distance, or other ways of satisfying a subtraction function, and the like, as long as the editing distance is larger, the editing probability value is smaller, and the embodiment of the present application is not limited herein.
Further, the computer device may determine a modification probability value corresponding to the candidate word according to the second context probability value and the edit probability value corresponding to the candidate word. The computer device determines a correction probability value corresponding to the candidate word according to the second context probability value and the editing probability value, for example, the second context probability value and the editing probability value may be multiplied to obtain a corresponding correction probability value; or, the computer device may further perform weighted summation on the second context probability value and the editing probability value to obtain a corresponding correction probability value, and the like, which is not limited herein in the embodiment of the present application.
Further, for each word to be corrected, the computer device may screen out, from the candidate words, a candidate word that satisfies the correction condition as a corrected word corresponding to the word to be corrected according to the correction probability value of the corresponding candidate word. The candidate word that satisfies the correction condition may be a candidate word with the highest correction probability value among the candidate words. And then, the computer equipment can replace the corresponding word to be corrected by the correction word to obtain the corrected text corresponding to the initial text.
In the above embodiment, when the spelling correction is performed on the initial text, the editing distance between the word to be corrected and the candidate word is considered, and the semantic information between contexts is also considered, so that the accuracy of the spelling correction is greatly improved.
In one embodiment, the computer device may train the spell correction model by a second noisy text sample in which the spelling error exists and a corresponding second corpus of reference text corrected for the spelling error. The computer device may input the second noisy text sample into a spell correction model to be trained, segment a complete sentence in the input first noisy text sample into individual words by the spell correction model, and determine a list of n-grams composed of the individual words. The computer device may get a similar list for each sentence in the corpus and count the conditional probability of the occurrence of the element in each n-gram. And when the conditional probability is smaller than the preset probability value, determining the word as a word to be corrected and determining a candidate word corresponding to the word to be corrected. The spell correction model may determine a corresponding predictive correction probability based on the conditional probability and the edit probability corresponding to each candidate word. The spell correction model may output the predictive corrected text with the candidate word having the highest probability of predictive correction as the corrected word. The computer device may adjust the model parameters based on a difference between the predictive modified text and the second reference text, and continually iterate the calculation to train the spell correction model. And stopping training when the training stopping condition is met, and obtaining the trained spelling error correction model. The training stopping condition may specifically be that the difference is smaller than a preset difference, or that the number of iterations is reached, or the like.
In the above embodiment, in the process of performing spelling error correction on the initial text, semantic information is considered, that is, determining which words to be corrected are according to the first context probability value of each initial word appearing in the whole sentence, and then screening out accurate corrected words from the candidate words according to the second context probability value of the candidate words corresponding to the words to be corrected appearing in the whole sentence, so as to perform spelling error correction on the initial text, thereby greatly improving the accuracy of spelling error correction.
In one embodiment, the step of outputting the corrected text corresponding to the initial text based on the corrected word specifically includes: determining a candidate word pair consisting of a word to be corrected and a corresponding correction word; when the word to be corrected in the candidate word pair is not found from the preset word list, determining the candidate word pair as a target word pair; and outputting a corrected text corresponding to the initial text according to the target word pair determined from the candidate word pairs.
It will be appreciated that the computer device spells the initial text in the spell correction model, but that there are sometimes some undesirable modifications. For example, for some rare collocations, modifications may be deemed misspellings in some scenarios, which may result in semantic changes that are undesirable spelling modifications. Based on this, the embodiment of the application provides a method for performing spelling error correction by matching a spelling error correction model and a word list review, which can greatly improve the accuracy of spelling error correction and avoid unexpected modification.
In one embodiment, after the computer device determines the modified word corresponding to the word to be corrected, the computer device may determine a candidate word pair consisting of the word to be corrected and the corresponding modified word. For each group of candidate word pairs, the computer device can search the word to be corrected from the preset word list, and when the word to be corrected is searched from the preset word list, the word to be corrected is considered to be a correct word, no spelling error exists, and the word to be corrected should not be modified, so that the candidate word pair can be deleted. And when the word to be corrected in the candidate word pair is not found from the preset word list, determining that the word to be corrected is a misspelled word and should be modified, and determining that the candidate word pair is a target word pair and can be reserved. The preset word list may be a commonly used word list composed of correctly spelled words. Further, the computer device may act the retained target word pair on the initial text to obtain a modified text corresponding to the initial text.
In the embodiment, the revising words determined in the last step are reviewed through the preset word list, so that unnecessary revising from the correctly spelled word to another correctly spelled word can be eliminated, the revising accuracy is improved, and the unexpected revising occurrence rate is reduced.
In one embodiment, the step S202, namely obtaining the initial text to be corrected, includes: displaying a text input box; and acquiring initial text to be corrected input into the text input box. The text processing method also comprises the steps of displaying the text error correction box at the peripheral position of the text input box in parallel; and outputting the target text through the text error correction box.
In one embodiment, the computer device may present the text entry box through a browser or application. The user can input the initial text to be corrected into the text input box through the external input device. When the computer device detects a triggering operation for instructing text error correction, the text processing method mentioned in the embodiment of the present application may be triggered to be executed to obtain a target text for error correction processing on the initial text. The computer equipment can display the text error correction box in parallel at the peripheral position of the text input box, and output and display the target text through the text error correction box. The peripheral position of the text input box may be specifically the left side, the right side, the upper side or the lower side of the text input box.
The triggering operation is a preset operation for triggering execution of the text processing method, and the triggering operation may specifically be a touch operation, a cursor operation, a key operation, or a voice operation. The touch operation can be touch click operation, touch press operation or touch slide operation, and the touch operation can be single-point touch operation or multi-point touch operation; the cursor operation can be an operation of controlling a cursor to click or an operation of controlling the cursor to press; the key operation may be a virtual key operation or a physical key operation, etc.
In a specific application scenario, referring to fig. 6, fig. 6 is an example of an interface presentation diagram of text correction, and a computer device of the interface presentation diagram of text correction in an embodiment may present a text input box 601 above a presentation interface and present a text correction box 602 below the presentation interface. The user can input the initial text to be detected in the text input box, such as "the y come here not for money and for life". The user may click the "execute" control 603 to trigger execution of the text processing method, so as to perform error correction processing on the input initial text to obtain a target text. The target text may be presented in a text correction box.
In practical application, when the initial text is an english text, the text processing method can be widely applied to english teaching, such as english composition correction, english writing practice, daily english text communication, daily english mail, and the like. By executing the text processing method in the embodiment of the application, the error correction accuracy can be greatly improved. It should be understood that the above-mentioned scenario application of the english language is only an exemplary illustration, and the text processing method mentioned in the embodiment of the present application may also be applied to error correction of other languages, which is not limited in the embodiment of the present application.
In a particular embodiment, referring to FIG. 7, FIG. 7 shows a flow diagram of the text processing method in one embodiment. As shown in FIG. 7, the computer device may pre-acquire the trained forward codec L-R, the trained reverse codec R-L, and a language evaluation model LM (language model). The input and the output of the coder and the decoder are sentences before and after complete decoding, the input of the language evaluation model LM is a sentence, and the output is the score of the reasonable degree of the sentence.
As shown in the flowchart of fig. 7, the text processing method includes the following steps:
(1) processing an input sentence in an initial text to be subjected to error correction processing into a sentence in a format required by a coder-decoder, wherein the steps included herein specifically include: a) the original sentence is tokenized (tokenized). Each word in the sentence is separated from the punctuation by a space. b) And performing spelling correction processing to obtain a corrected text. c) And (3) carrying out a tokenized sentence compression coding process (such as BPE () to obtain compressed text (such as a sentence re-represented by BPE words).
(2) And respectively inputting the BPE sentences to a forward codec L-R and a backward codec R-L, and decoding to obtain candidate texts. The steps contained herein specifically include: a) the BPE-ized sentences are input to the forward codec L-R and the reverse codec R-L, respectively. b) The model decodes to get BPE-typed sentences. c) And performing decompression decoding processing (such as BPE processing) to obtain a first candidate text and a second candidate text.
(3) And inputting the first candidate text into the language evaluation model to obtain a corresponding first evaluation score. And respectively inputting the second candidate texts into the language evaluation model to obtain corresponding second evaluation scores.
(4) The first evaluation score is input to a selection module. And multiplying the second evaluation score by a weight less than 1 and inputting the result to the selection module. And comparing the correction results of the forward codec L-R and the reverse codec R-L by the selection module, judging who has higher scores, and outputting candidate texts output by the model with higher scores after weighting as final output.
(5) And (4) the finally output sentence is subjected to tokenized (unmarked) to form a normal sentence, and the sentence is finally output as a target text for grammar error correction.
In one embodiment, the specific steps of the spell correction processing in the above embodiment may refer to fig. 8, and as shown in fig. 8, the steps of the spell correction processing specifically include:
(1) the n-gram list of the words is obtained through a context-based n-gram statistical language model.
(2) And calculating a first context probability value of each initial word appearing in a specific context by using the context information provided by the n-gram and combining a spelling error correction model.
(3) And taking the initial word with the first context probability value smaller than the preset probability value as a word to be corrected. And determining a second context probability value of the candidate word corresponding to the word to be corrected appearing in the specific context.
(4) And calculating the editing distance between each word to be corrected and the candidate word, and determining the corresponding editing probability value according to the editing distance.
(5) And multiplying the second context probability value and the editing probability value to obtain a correction probability value corresponding to each word to be corrected, taking the candidate word with a large correction probability value as the correction word of the word to be corrected, and obtaining a candidate word pair consisting of the word to be corrected and the correction word.
(6) Rechecking the word to be corrected in the candidate word pair obtained in the last step, and deleting the candidate word pair if the word to be corrected is in the word list; if not, the candidate word pair is reserved as the target word pair.
(7) And acting the reserved target word pair on the initial text to obtain a modified text.
As shown in fig. 9, in a specific embodiment, the text processing method is exemplified by the terminal in fig. 1, and the text processing method includes the following steps:
and S902, displaying the text input box.
And S904, acquiring the initial text to be corrected, which is input into the text input box.
S906, the whole sentence in the initial text is cut into single initial words, and at least one group of multi-tuples consisting of the initial words are determined.
S908, based on the tuple, calculating a first context probability value that each initial word appears in the whole sentence.
S910, when the first context probability value is smaller than the preset probability value, determining the initial word as a word to be corrected and determining a candidate word corresponding to the word to be corrected.
S912, calculating a second context probability value of the candidate word appearing in the whole sentence based on the tuple.
And S914, determining the editing distance between the word to be corrected and each corresponding candidate word.
And S916, determining the editing probability value corresponding to each candidate word according to the editing distance.
S918, determining a correction probability value according to the second context probability value and the editing probability value corresponding to each candidate word.
S920, for each word to be corrected, selecting a corrected word meeting the correction condition from the candidate words according to the correction probability value of the corresponding candidate word.
And S922, determining a candidate word pair consisting of the word to be corrected and the corresponding correction word.
And S924, when the word to be corrected in the candidate word pair is not found in the preset word list, determining that the candidate word pair is the target word pair.
S926, outputting a corrected text corresponding to the initial text according to the target word pair determined from the candidate word pairs.
And S928, carrying out compression coding processing on the corrected text to obtain a coded text.
S930, inputting the encoded text to the pre-trained forward codec and the reverse codec, respectively.
And S932, performing syntax error correction processing on the input coded text through a forward codec to obtain a corresponding first output text.
And S934, performing syntax error correction processing on the input coded text through a reverse codec to obtain a corresponding second output text.
S936, performing decompression decoding processing on the first output text and the second output text respectively to obtain a first candidate text and a second candidate text.
And S938, respectively performing language evaluation processing on the first candidate text and the second candidate text through a language evaluation model to obtain a first evaluation score corresponding to the first candidate text and a second evaluation score corresponding to the second candidate text.
S940, multiplying the second evaluation score by a preset weight to obtain a third evaluation score; the preset weight is smaller than a numerical value one.
S942, when the first evaluation score is greater than or equal to the third evaluation score, the first candidate text is taken as the target text corresponding to the initial text.
And S944, when the first evaluation score is smaller than the third evaluation score, taking the second candidate text as the target text corresponding to the initial text.
S946, the text error correction box is displayed in parallel at the peripheral position of the text input box.
S948, the target text is output through the text correction box.
According to the text processing method, spelling error correction processing is carried out on the initial text to be corrected, so that the corresponding corrected text is obtained, the difficulty of subsequent grammar error correction is reduced, and the error correction correctness is improved. And performing grammar error correction processing on the corrected text respectively through a pre-trained forward codec and a pre-trained backward codec to obtain a first candidate text and a second candidate text, and generating an error-corrected target text for the initial text based on evaluation scores corresponding to the first candidate text and the second candidate text. Therefore, through parallel forward and reverse decoding processing, forward decoding information and reverse decoding information are considered, syntax error accumulation caused by continuous decoding is avoided, and the accuracy of error correction on the initial text is greatly improved.
FIG. 9 is a flowchart illustrating a method of text processing in one embodiment. It should be understood that, although the steps in the flowchart of fig. 9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 9 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
For the text processing method mentioned in the embodiment of the present application, a set of experimental verification is performed to prove that the accuracy of text error correction can be effectively improved by executing the method. In this experiment, the codec was based on a Transformer model, where the test set was open test set CoNLL-2014 (a test set) for syntax checking problems.
The experimental alignment test results are as follows:
Figure BDA0002179711020000221
note: the precision represents the ratio of correct modification errors to the total number of modifications, the recall represents the ratio of correct modification to the total number of errors, F _0.5 is a balance between precision and recall, and generally, the higher F _0.5 is, the better the error correction effect is.
And (4) experimental conclusion:
1) experiments 1 and 2 are the result of conventional one-way decoding, and it is clear that forward decoding is indeed superior to reverse decoding.
2) Experiments 3 and 4 are serial decoding ideas, and it can be found that serial decoding can cause the final decoding result to be worse than the forward decoding result, that is, the second decoding can introduce more noise.
3) Experiment 5 is parallel decoding, but regardless of the fact that reverse decoding is worse than forward decoding, it can be found that decoding effects actually exhibited may be deteriorated without setting the results of reverse decoding lowering evaluation score weights.
4) Experiment 6 is parallel decoding, takes into account the fact that reverse decoding is worse than forward decoding, and reduces the weight, and the result shows that the precision, recall and F _0.5 are all significantly improved on the aspect of syntax error correction.
5) Experiment 7 is the result of adding spell check, and it can be seen that spell check is greatly improved for the final result.
The results of the spell check module are as follows:
the test set is CoNLL-2014 using the English vocabulary of Enchant (an open source spell checking tool) with open source models of spell correction provided by Jamspell (an open source spell checking tool on git-hub that predicts potentially erroneous words based on language models to give modifications). The test set is originally a test set for syntax error detection, but also contains partial spelling errors.
The experimental results are as follows:
Figure BDA0002179711020000231
and (4) experimental conclusion: it can be seen that the spell correction method mentioned in the embodiments of the present application can significantly improve the accuracy of the modification, and abandon the unreasonable modifications in jamscope, which are mainly from one word without spelling errors to another word without spelling errors, and the modification may be contextual, which is equivalent to correcting grammatical errors. The modification of the original correct word by the spelling error correction model is limited through the word list, and the accuracy of the spelling error correction is greatly improved. As an example, for a sentence: as a result, if the location keep go on in this summary trip, it will use a bad effect on the sounding generation "modified by Jamspall, which modifies" if "to" of "because" of the location "is more common in the corpus than" if the location ". However, if is actually the virtual mood, it is the correct usage, and the modification is discarded by the modification method in this application. Experiments show that the text processing method provided by the embodiment of the application can greatly improve the accuracy of text error correction.
As shown in fig. 10, in one embodiment, a text processing apparatus 1000 is provided, comprising an acquisition module 1001, a spell correction module 1002, a grammar correction module 1003, a language evaluation module 1004, and a determination module 1005, wherein,
an obtaining module 1001 is configured to obtain an initial text to be corrected.
And a spell correction module 1002, configured to perform spell correction processing on the initial text to obtain a corresponding corrected text.
And the syntax error correction module 1003 is configured to perform syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained backward codec, respectively, to obtain a first candidate text and a second candidate text.
The language evaluation module 1004 is configured to perform language evaluation processing based on the first candidate text and the second candidate text to obtain an evaluation score.
A determining module 1005, configured to determine, according to the evaluation score and according to the first candidate text and the second candidate text, a target text corresponding to the initial text.
In one embodiment, the spell correction module 1002 is further configured to segment the whole sentence in the initial text into individual initial words and determine at least one group of multiple groups of the initial words; calculating a first context probability value of each initial word appearing in the whole sentence based on the tuple; when the first context probability value is smaller than the preset probability value, determining the initial word as a word to be corrected and determining a candidate word corresponding to the word to be corrected; calculating a second context probability value of the candidate word appearing in the whole sentence based on the tuple; and screening out the corrected words meeting the correction conditions from the candidate words according to the second context probability value, and outputting the corrected text corresponding to the initial text based on the corrected words.
In one embodiment, the spelling correction module 1002 is further configured to determine an edit distance between the word to be corrected and each corresponding candidate word; determining an editing probability value corresponding to each candidate word according to the editing distance; determining a correction probability value according to a second context probability value and an editing probability value corresponding to each candidate word; for each word to be corrected, screening out a corrected word meeting the correction condition from the candidate words according to the correction probability value of the corresponding candidate word; and outputting a corrected text corresponding to the initial text based on the corrected words.
In one embodiment, the spell correction module 1002 is further configured to determine a candidate word pair consisting of a word to be corrected and a corresponding corrected word; when the word to be corrected in the candidate word pair is not found from the preset word list, determining the candidate word pair as a target word pair; and outputting a corrected text corresponding to the initial text according to the target word pair determined from the candidate word pairs.
In one embodiment, the syntax correcting module 1003 is further configured to perform compression coding processing on the modified text to obtain a coded text; respectively inputting the coded texts into a pre-trained forward codec and a pre-trained reverse codec; carrying out syntax error correction processing on an input coded text through a forward codec to obtain a corresponding first output text; carrying out syntax error correction processing on the input coded text through a reverse codec to obtain a corresponding second output text; and respectively carrying out decompression decoding processing on the first output text and the second output text to obtain a first candidate text and a second candidate text.
In one embodiment, the language evaluation module 1004 is further configured to perform language evaluation processing on the first candidate text and the second candidate text respectively through a language evaluation model to obtain evaluation scores corresponding to the first candidate text and the second candidate text respectively; the determining module 1005 is further configured to filter out, from the first candidate text and the second candidate text, candidate texts whose respective evaluation scores satisfy the first target condition as target texts corresponding to the initial text.
In one embodiment, the evaluation score comprises a first evaluation score corresponding to the first candidate text and a second evaluation score corresponding to the second candidate text; the determining module 1005 is further configured to multiply the second evaluation score by a preset weight to obtain a third evaluation score; the preset weight is smaller than a numerical value one; when the first evaluation score is larger than or equal to the third evaluation score, the first candidate text is used as a target text corresponding to the initial text; and when the first evaluation score is smaller than the third evaluation score, the second candidate text is taken as the target text corresponding to the initial text.
In one embodiment, the language assessment module 1004 is further configured to compare the initial text with the first candidate text, determine a first set of differences formed by differences between the initial text and the first candidate text; comparing the initial text with the second candidate text, and determining a second difference set formed by the differences between the initial text and the second candidate text; determining more than one group of combined texts corresponding to the initial texts by combining the differences in the first difference set and the second difference set; respectively carrying out language evaluation processing on each combined text through a language evaluation model to obtain an evaluation score corresponding to each combined text; the determining module 1005 is further configured to determine the combined text with the evaluation score satisfying the second target condition as the target text corresponding to the initial text.
In one embodiment, referring to fig. 11, the text processing apparatus 1000 further includes:
a display module 1006, configured to display the text input box.
The obtaining module 1001 is further configured to obtain an initial text to be corrected, which is input into the text input box.
The display module 1006 is further configured to display the text correction box in parallel at a peripheral position of the text input box; and outputting the target text through the text error correction box.
The text processing device carries out spelling error correction processing on the initial text to be corrected to obtain the corresponding corrected text, thereby reducing the difficulty of subsequent grammar error correction and increasing the correctness of error correction. And performing grammar error correction processing on the corrected text respectively through a pre-trained forward codec and a pre-trained backward codec to obtain a first candidate text and a second candidate text, and generating an error-corrected target text for the initial text based on evaluation scores corresponding to the first candidate text and the second candidate text. Therefore, through parallel forward and reverse decoding processing, forward decoding information and reverse decoding information are considered, syntax error accumulation caused by continuous decoding is avoided, and the accuracy of error correction on the initial text is greatly improved.
FIG. 12 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 or the server 120 in fig. 1. As shown in fig. 12, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the text processing method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a text processing method.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the text processing apparatus provided in the present application may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 12. The memory of the computer device may store therein various program modules constituting the text processing apparatus, such as an acquisition module 1001, a spelling correction module 1002, a grammar correction module 1003, a language evaluation module 1004, and a determination module 1005 shown in fig. 10. The computer program constituted by the respective program modules causes the processor to execute the steps in the text processing method of the respective embodiments of the present application described in the present specification.
For example, the computer apparatus shown in fig. 12 may perform step S202 by the acquisition module in the text apparatus shown in fig. 10. The computer device may perform step S204 by the spell correction module. The computer device may perform step S206 by the syntax error correction module. The computer device may perform step S208 through the language assessment module. The computer device may perform step S210 by the determination module.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the text processing method described above. Here, the steps of the text processing method may be steps in the text processing methods of the above-described respective embodiments.
In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned text processing method. Here, the steps of the text processing method may be steps in the text processing methods of the above-described respective embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (14)

1. A text processing method, comprising:
acquiring an initial text to be corrected;
determining words to be corrected according to a first context probability value of each initial word in the initial text appearing in the whole sentence by a spelling correction model, and screening out correction words meeting correction conditions from corresponding candidate words according to a second context probability value of the candidate words corresponding to the determined words to be corrected appearing in the whole sentence;
when the word to be corrected is not found in a preset word list, modifying the word to be corrected through the correction word, otherwise, reserving the word to be corrected to obtain a corrected text corresponding to the initial text;
respectively carrying out syntax error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
respectively performing language evaluation processing on the first candidate text and the second candidate text through a language evaluation model to obtain a first evaluation score corresponding to the first candidate text and a second evaluation score corresponding to the second candidate text;
multiplying the second evaluation score by a preset weight to obtain a third evaluation score; the preset weight is smaller than a numerical value one;
when the first evaluation score is greater than or equal to the third evaluation score, taking a first candidate text as a target text corresponding to the initial text;
and when the first evaluation score is smaller than the third evaluation score, taking a second candidate text as a target text corresponding to the initial text.
2. The method according to claim 1, wherein the determining, by the spelling correction model, words to be corrected according to a first context probability value that each initial word in the initial text appears in the whole sentence, and screening out a corrected word satisfying a correction condition from corresponding candidate words according to a second context probability value that a candidate word corresponding to the determined words to be corrected appears in the whole sentence, includes:
segmenting the whole sentence in the initial text into single initial words, and determining at least one group of multi-element groups consisting of the initial words;
calculating a first context probability value of each initial word appearing in the whole sentence based on the tuple;
when the first context probability value is smaller than a preset probability value, determining the initial word as a word to be corrected and determining a candidate word corresponding to the word to be corrected;
calculating a second context probability value that the candidate word appears in the whole sentence based on the tuple;
and screening out the correction words meeting the correction conditions from the candidate words according to the second context probability value.
3. The method of claim 2, wherein the selecting a corrected word satisfying a correction condition from the candidate words according to the second context probability value comprises:
determining the editing distance between the word to be corrected and each corresponding candidate word;
determining an editing probability value corresponding to each candidate word according to the editing distance;
determining a correction probability value according to a second context probability value and an editing probability value corresponding to each candidate word;
and for each word to be corrected, screening out the corrected words meeting the correction conditions from the candidate words according to the correction probability values of the corresponding candidate words.
4. The method according to claim 1, wherein the modifying the word to be corrected by the modifying word when the word to be corrected is not found in a preset word list, and otherwise, the retaining the word to be corrected to obtain a modified text corresponding to the initial text comprises:
determining a candidate word pair consisting of the word to be corrected and the corresponding correction word;
when the word to be corrected in the candidate word pair is not found from a preset word list, determining the candidate word pair as a target word pair;
and outputting a corrected text corresponding to the initial text according to the target word pair determined from the candidate word pair.
5. The method of claim 1, wherein the performing syntax error correction processing on the modified text by the pre-trained forward codec and the pre-trained backward codec to obtain a first candidate text and a second candidate text respectively comprises:
carrying out compression coding processing on the corrected text to obtain a coded text;
inputting the coded texts into a pre-trained forward codec and a pre-trained reverse codec respectively;
carrying out syntax error correction processing on the input coded text through the forward codec to obtain a corresponding first output text;
carrying out syntax error correction processing on the input coded text through the reverse codec to obtain a corresponding second output text;
and respectively carrying out decompression decoding processing on the first output text and the second output text to obtain a first candidate text and a second candidate text.
6. The method according to any one of claims 1 to 5, wherein the obtaining of the initial text to be corrected comprises:
displaying a text input box;
acquiring an initial text to be corrected, which is input into the text input box;
the method further comprises the following steps:
displaying the text error correction box at the peripheral position of the text input box in parallel;
and outputting the target text through the text error correction box.
7. A text processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring an initial text to be corrected;
the spelling correction module is used for determining words to be corrected according to a first context probability value of each initial word in the initial text appearing in the whole sentence through a spelling correction model, and screening out correction words meeting correction conditions from corresponding candidate words according to a second context probability value of the candidate words corresponding to the determined words to be corrected appearing in the whole sentence; when the word to be corrected is not found in a preset word list, modifying the word to be corrected through the correction word, otherwise, reserving the word to be corrected to obtain a corrected text corresponding to the initial text;
the grammar error correction module is used for respectively carrying out grammar error correction processing on the corrected text through a pre-trained forward codec and a pre-trained reverse codec to obtain a first candidate text and a second candidate text;
the language evaluation module is used for respectively carrying out language evaluation processing on the first candidate text and the second candidate text through a language evaluation model to obtain a first evaluation score corresponding to the first candidate text and a second evaluation score corresponding to the second candidate text;
the determining module is used for multiplying the second evaluation score by a preset weight to obtain a third evaluation score; the preset weight is smaller than a numerical value one; when the first evaluation score is greater than or equal to the third evaluation score, taking a first candidate text as a target text corresponding to the initial text; and when the first evaluation score is smaller than the third evaluation score, taking a second candidate text as a target text corresponding to the initial text.
8. The apparatus of claim 7, wherein the spell correction module is further configured to segment the entire sentence in the initial text into individual initial words and determine at least one group of multi-element groups of the initial words; calculating a first context probability value of each initial word appearing in the whole sentence based on the tuple; when the first context probability value is smaller than a preset probability value, determining the initial word as a word to be corrected and determining a candidate word corresponding to the word to be corrected; calculating a second context probability value that the candidate word appears in the whole sentence based on the tuple; and screening out the correction words meeting the correction conditions from the candidate words according to the second context probability value.
9. The apparatus according to claim 8, wherein the spelling correction module is further configured to determine an edit distance between the word to be corrected and each corresponding candidate word; determining an editing probability value corresponding to each candidate word according to the editing distance; determining a correction probability value according to a second context probability value and an editing probability value corresponding to each candidate word; and for each word to be corrected, screening out the corrected words meeting the correction conditions from the candidate words according to the correction probability values of the corresponding candidate words.
10. The apparatus of claim 7, wherein the spell correction module is further configured to determine a candidate word pair consisting of the word to be corrected and the corresponding corrected word; when the word to be corrected in the candidate word pair is not found from a preset word list, determining the candidate word pair as a target word pair; and outputting a corrected text corresponding to the initial text according to the target word pair determined from the candidate word pair.
11. The apparatus according to claim 7, wherein the syntax correcting module is further configured to perform compression coding processing on the modified text to obtain a coded text; inputting the coded texts into a pre-trained forward codec and a pre-trained reverse codec respectively; carrying out syntax error correction processing on the input coded text through the forward codec to obtain a corresponding first output text; carrying out syntax error correction processing on the input coded text through the reverse codec to obtain a corresponding second output text; and respectively carrying out decompression decoding processing on the first output text and the second output text to obtain a first candidate text and a second candidate text.
12. The apparatus of any one of claims 7 to 11, further comprising:
the display module is used for displaying the text input box;
the acquisition module is further used for acquiring the initial text to be corrected, which is input into the text input box;
the display module is also used for displaying the text error correction box in parallel at the peripheral position of the text input box; and outputting the target text through the text error correction box.
13. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 6.
14. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 6.
CN201910791618.7A 2019-08-26 2019-08-26 Text processing method and device Active CN110502754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910791618.7A CN110502754B (en) 2019-08-26 2019-08-26 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910791618.7A CN110502754B (en) 2019-08-26 2019-08-26 Text processing method and device

Publications (2)

Publication Number Publication Date
CN110502754A CN110502754A (en) 2019-11-26
CN110502754B true CN110502754B (en) 2021-05-28

Family

ID=68589610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910791618.7A Active CN110502754B (en) 2019-08-26 2019-08-26 Text processing method and device

Country Status (1)

Country Link
CN (1) CN110502754B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297833A (en) * 2020-02-21 2021-08-24 华为技术有限公司 Text error correction method and device, terminal equipment and computer storage medium
CN111460794A (en) * 2020-03-11 2020-07-28 云知声智能科技股份有限公司 Grammar error correction method for increasing spelling error correction function
CN111626049B (en) * 2020-05-27 2022-12-16 深圳市雅阅科技有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN112001169B (en) * 2020-07-17 2022-03-25 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN111897535A (en) * 2020-07-30 2020-11-06 平安科技(深圳)有限公司 Grammar error correction method, device, computer system and readable storage medium
CN114372441B (en) * 2022-03-23 2022-06-03 中电云数智科技有限公司 Automatic error correction method and device for Chinese text
CN114861635B (en) * 2022-05-10 2023-04-07 广东外语外贸大学 Chinese spelling error correction method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7389220B2 (en) * 2000-10-20 2008-06-17 Microsoft Corporation Correcting incomplete negation errors in French language text
CN103198149B (en) * 2013-04-23 2017-02-08 中国科学院计算技术研究所 Method and system for query error correction
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN108694167B (en) * 2018-04-11 2022-09-06 广州视源电子科技股份有限公司 Candidate word evaluation method, candidate word ordering method and device

Also Published As

Publication number Publication date
CN110502754A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN110502754B (en) Text processing method and device
CN108287858B (en) Semantic extraction method and device for natural language
KR102268875B1 (en) System and method for inputting text into electronic devices
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
CN111079412A (en) Text error correction method and device
CN111310447A (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN111460793A (en) Error correction method, device, equipment and storage medium
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN113449514B (en) Text error correction method and device suitable for vertical field
CN111666764A (en) XLNET-based automatic summarization method and device
JP2020190970A (en) Document processing device, method therefor, and program
KR20230061001A (en) Apparatus and method for correcting text
CN112527967A (en) Text matching method, device, terminal and storage medium
US11907656B2 (en) Machine based expansion of contractions in text in digital media
Li et al. Chinese spelling check based on neural machine translation
Mekki et al. COTA 2.0: An automatic corrector of tunisian Arabic social media texts
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
KR102354898B1 (en) Vocabulary list generation method and device for Korean based neural network language model
US20240135089A1 (en) Text error correction method, system, device, and storage medium
CN108304362B (en) Clause detection method and device
CN117422064A (en) Search text error correction method, apparatus, computer device and storage medium
CN115968474A (en) Non-transitory storage medium storing sentence fragment extraction program in computer language processing, semantically similar sentence fragment extraction method, and language processing device
CN115906814A (en) Identification method and device
CN114896382A (en) Artificial intelligent question-answering model generation method, question-answering method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant