CN116484842A - Statement error correction method and device, electronic equipment and storage medium - Google Patents

Statement error correction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116484842A
CN116484842A CN202210041159.2A CN202210041159A CN116484842A CN 116484842 A CN116484842 A CN 116484842A CN 202210041159 A CN202210041159 A CN 202210041159A CN 116484842 A CN116484842 A CN 116484842A
Authority
CN
China
Prior art keywords
sentence
statement
generated
corrected
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210041159.2A
Other languages
Chinese (zh)
Inventor
伍正祥
王浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Wuhan Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN202210041159.2A priority Critical patent/CN116484842A/en
Publication of CN116484842A publication Critical patent/CN116484842A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application relates to the technical field of text processing and discloses a statement error correction method, which comprises the following steps: performing word block replacement processing on the statement to be corrected to obtain at least one generated statement; each generated sentence is respectively combined with the sentences to be corrected to form at least one sentence pair corresponding to at least one generated sentence one by one; extracting features of each sentence pair to obtain sentence pair features of each sentence pair; scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence; and correcting the statement to be corrected according to the score to obtain the statement after correction. Because the relation between the generated error correction statement and the original sentence is considered, the final error correction statement can be determined in the generated error correction statement based on the relation between the generated error correction statement and the original sentence, so that the effect of correcting the sentence is improved. The application also discloses a statement error correction device, electronic equipment and a storage medium.

Description

Statement error correction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of text processing technologies, for example, to a method and apparatus for correcting sentence errors, an electronic device, and a storage medium.
Background
At present, each industry relates to sentence processing in life and work, and as word, pronunciation, character, grammar, sequence and the like in sentences are complex, various sentences often have error conditions, and great demand is required for sentence error correction. In the prior art, when sentence correction is performed, a plurality of alternative correction sentences are obtained, then the characteristics of the alternative correction sentences are extracted, and the final correction sentences are determined according to the characteristics of the alternative correction sentences.
In the prior art, when extracting the characteristics of the alternative correction sentences, only the conditional probability of co-occurrence of two words in the alternative correction sentences, the frequency of occurrence of the words in the alternative correction sentences in a corpus and the like are generally considered, and only the characteristics of the alternative correction sentences are considered, so that the error correction effect on the sentences is poor.
Disclosure of Invention
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows.
The embodiment of the disclosure provides a statement error correction method and device, electronic equipment and storage medium, so as to improve the effect of statement error correction.
In some embodiments, a method of statement error correction includes: performing word block replacement processing on the statement to be corrected to obtain at least one generated statement; combining each generated sentence with the sentence to be corrected to form at least one sentence pair corresponding to the at least one generated sentence one by one; extracting features of each sentence pair to obtain sentence pair features of each sentence pair; scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence; and correcting the statement to be corrected according to the score to obtain the statement after correction.
In some embodiments, performing word block replacement processing on a statement to be corrected to obtain at least one generated statement, including: acquiring a first alternative word block, wherein the first alternative word block is formed by splicing a plurality of continuous characters in the sentence to be corrected; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block; if the first alternative word block is not contained in the preset word block list, replacing a first character in the statement to be corrected by using the mixed character in the mixed character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block; if the second alternative word block is contained in the preset word block table, replacing the first alternative word block in the statement to be corrected with the second alternative word block, and obtaining a generated statement of the statement to be corrected.
In some embodiments, performing feature extraction processing on each sentence pair to obtain a sentence pair feature of each sentence pair, including: the following processing is performed for each of the sentence pairs: extracting one or more of score features, edit distance features, confusion score features and word block quantity difference features of the sentence pairs; and determining one or more of the score feature, the editing distance feature, the confusion degree score feature and the word block quantity difference feature of the sentence pair as the sentence pair feature of the sentence pair.
In some embodiments, extracting the fractional features of the sentence pairs comprises: acquiring a first character probability of a character position for carrying out word block replacement processing in the sentence to be corrected in the sentence pair; acquiring a second character probability of the character position of the generated sentence in the sentence pair after being replaced by the word block; calculating a first difference between the second character probability and the first character probability; and determining the first character probability, the second character probability and the first difference value as the score characteristic of the sentence pair.
In some embodiments, extracting edit distance features of the sentence pairs includes: calculating the Chinese character editing distance between the generated sentence and the sentence to be corrected in the sentence pair; converting the generated sentences and the sentences to be corrected in the sentence pairs into pinyin sequences, and respectively calculating pinyin editing distances between the generated sentences and the sentences to be corrected in the sentence pairs based on the pinyin sequences after conversion; and determining the Chinese character editing distance and the pinyin editing distance corresponding to the sentence pair as the editing distance characteristic of the sentence pair.
In some embodiments, extracting the confusion score feature for the sentence pair comprises: acquiring a first confusion degree score of the sentence pair to be corrected; obtaining a second confusion score of the sentence generated in the sentence pair; obtaining a second difference between the second confusion score and the first confusion score in the sentence pair; and determining the first confusion degree score, the second confusion degree score and the second difference value corresponding to the sentence pair as the confusion degree score characteristic of the sentence pair.
In some embodiments, obtaining the word block number difference feature of the sentence pair includes: acquiring the number of first word blocks of the sentence to be corrected in the sentence pair; acquiring the number of second word blocks of the sentence generated in the sentence pair; obtaining a third difference value between the number of second word blocks and the number of first word blocks in the sentence pair; and determining the third difference value corresponding to the sentence pair as a word block quantity difference characteristic of the sentence pair.
In some embodiments, scoring each of the generated sentences according to sentence pair characteristics of each of the sentence pairs to obtain a score for each of the generated sentences comprises: acquiring weight vectors of the sentence pair features; and for each of the sentence pairs, performing the following: and scoring the generated sentences in the sentence pairs according to the sentence pair characteristics and the weight vectors of the sentence pairs to obtain the scores of the generated sentences in the sentence pairs.
In some embodiments, correcting the statement to be corrected according to the score to obtain a corrected statement, including: selecting a generated sentence corresponding to the highest score as an alternative sentence; and determining the alternative statement as the statement after error correction corresponding to the statement to be error corrected.
In some embodiments, the statement error correction apparatus includes: the replacing module is configured to perform word block replacing processing on the statement to be corrected to obtain at least one generated statement; the sentence pair generating module is configured to combine each generating sentence with the sentence to be corrected respectively to form at least one sentence pair corresponding to the at least one generating sentence one by one; the feature extraction module is configured to perform feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair; the scoring module is configured to score each generated sentence according to sentence pair characteristics of each sentence pair, and obtain the score of each generated sentence; and the error correction module is configured to correct the statement to be corrected according to the score to obtain an corrected statement.
In some embodiments, the replacement module performs a word block replacement process on the statement to be corrected to obtain at least one generated statement by: acquiring a first alternative word block, wherein the first alternative word block is formed by splicing a plurality of continuous characters in the sentence to be corrected; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block; if the first alternative word block is not contained in the preset word block list, replacing a first character in the statement to be corrected by using the mixed character in the mixed character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block; if the second alternative word block is contained in the preset word block table, replacing the first alternative word block in the statement to be corrected with the second alternative word block, and obtaining a generated statement of the statement to be corrected.
In some embodiments, the feature extraction module performs feature extraction processing on each of the sentence pairs to obtain sentence pair features of each of the sentence pairs by: the following processing is performed for each of the sentence pairs: extracting one or more of score features, edit distance features, confusion score features and word block quantity difference features of the sentence pairs; and determining one or more of the score feature, the editing distance feature, the confusion degree score feature and the word block quantity difference feature of the sentence pair as the sentence pair feature of the sentence pair.
In some embodiments, the feature extraction module extracts the fractional features of the sentence pairs by: acquiring a first character probability of a character position for carrying out word block replacement processing in the sentence to be corrected in the sentence pair; acquiring a second character probability of the character position of the generated sentence in the sentence pair after being replaced by the word block; calculating a first difference between the second character probability and the first character probability; and determining the first character probability, the second character probability and the first difference value as the score characteristic of the sentence pair.
In some embodiments, the feature extraction module extracts edit distance features of the sentence pairs by: calculating the Chinese character editing distance between the generated sentence and the sentence to be corrected in the sentence pair; converting the generated sentences and the sentences to be corrected in the sentence pairs into pinyin sequences, and respectively calculating pinyin editing distances between the generated sentences and the sentences to be corrected in the sentence pairs based on the pinyin sequences after conversion; and determining the Chinese character editing distance and the pinyin editing distance corresponding to the sentence pair as the editing distance characteristic of the sentence pair.
In some embodiments, the feature extraction module extracts the confusion score feature for the sentence pair by: acquiring a first confusion degree score of the sentence pair to be corrected; obtaining a second confusion score of the sentence generated in the sentence pair; obtaining a second difference between the second confusion score and the first confusion score in the sentence pair; and determining the first confusion degree score, the second confusion degree score and the second difference value corresponding to the sentence pair as the confusion degree score characteristic of the sentence pair.
In some embodiments, the feature extraction module obtains word block number difference features for the sentence pairs by: acquiring the number of first word blocks of the sentence to be corrected in the sentence pair; acquiring the number of second word blocks of the sentence generated in the sentence pair; obtaining a third difference value between the number of second word blocks and the number of first word blocks in the sentence pair; and determining the third difference value corresponding to the sentence pair as a word block quantity difference characteristic of the sentence pair.
In some embodiments, the scoring module scores each of the generated sentences according to sentence pair characteristics of each of the sentence pairs to obtain a score for each of the generated sentences by: acquiring weight vectors of the sentence pair features; and for each of the sentence pairs, performing the following: and scoring the generated sentences in the sentence pairs according to the sentence pair characteristics and the weight vectors of the sentence pairs to obtain the scores of the generated sentences in the sentence pairs.
In some embodiments, the error correction module corrects the statement to be corrected according to the score to obtain an corrected statement by: selecting a generated sentence corresponding to the highest score as an alternative sentence; and determining the alternative statement as the statement after error correction corresponding to the statement to be error corrected.
In some embodiments, an electronic device includes a processor and a memory storing program instructions configured to perform the method of statement correction described above when the program instructions are executed.
In some embodiments, a storage medium stores program instructions that, when executed, perform the method of statement correction described above.
The statement error correction method and device, the electronic device and the storage medium provided by the embodiment of the disclosure can realize the following technical effects:
the sentence pairs are formed by the sentences to be corrected and the error correction sentences subjected to the word block replacement processing, so that the same sentence can be subjected to the word block replacement processing for a plurality of times, a plurality of sentence pairs can be generated under the condition of carrying out the word block replacement processing for the same sentence for a plurality of times, and the error correction sentences generated by the sentence pair feature pairs of the sentence pairs are scored, so that different error correction sentences can be selected according to different scores according to different error correction requirements. The error correction statement obtained by the error correction method considers the relation between the generated error correction statement and the original statement, and can determine the final error correction statement in the generated statement based on the relation between the generated error correction statement and the original statement, thereby improving the error correction effect on the statement.
The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which like reference numerals refer to similar elements, and in which:
FIG. 1 is a schematic diagram of a method of statement error correction provided by an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a statement to be corrected and a generated statement in an embodiment of the disclosure;
FIG. 3 is a schematic diagram of another statement error correction method provided by an embodiment of the disclosure;
FIG. 4 is a schematic illustration of one application of an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an apparatus for sentence correction provided by an embodiment of the present disclosure;
fig. 6 is a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
So that the manner in which the features and techniques of the disclosed embodiments can be understood in more detail, a more particular description of the embodiments of the disclosure, briefly summarized below, may be had by reference to the appended drawings, which are not intended to be limiting of the embodiments of the disclosure. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may still be practiced without these details. In other instances, well-known structures and devices may be shown simplified in order to simplify the drawing.
The terms first, second and the like in the description and in the claims of the embodiments of the disclosure and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe embodiments of the present disclosure. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion.
The term "plurality" means two or more, unless otherwise indicated.
In the embodiment of the present disclosure, the character "/" indicates that the front and rear objects are an or relationship. For example, A/B represents: a or B.
The term "and/or" is an associative relationship that describes an object, meaning that there may be three relationships. For example, a and/or B, represent: a or B, or, A and B.
The term "corresponding" may refer to an association or binding relationship, and the correspondence between a and B refers to an association or binding relationship between a and B.
Referring to fig. 1, an embodiment of the disclosure provides a method for correcting statement error, including:
step S101, performing word block replacement processing on a statement to be corrected to obtain at least one generated statement;
Step S102, combining each generated sentence with the sentence to be corrected to form at least one sentence pair corresponding to at least one generated sentence one by one;
step S103, carrying out feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair;
step S104, scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence;
step S105, correcting the statement to be corrected according to the score, and obtaining the statement after correction.
By adopting the sentence correction method provided by the embodiment of the disclosure, the sentence pairs are formed by the sentences to be corrected and the error correction sentences subjected to the word block replacement processing, so that the word block replacement processing can be performed on the same sentence for a plurality of times, a plurality of sentence pairs can be generated under the condition that the word block replacement processing is performed on the same sentence for a plurality of times, and the error correction sentences generated by the sentence pair feature pairs of the sentence pairs are scored, so that different error correction sentences can be selected according to different scores according to different error correction requirements. The error correction statement obtained by the error correction method considers the relation between the generated error correction statement and the original statement, and can determine the final error correction statement in the generated statement based on the relation between the generated error correction statement and the original statement, thereby improving the error correction effect on the statement.
Optionally, performing word block replacement processing on the statement to be corrected to obtain at least one generated statement, including: acquiring a first alternative word block, wherein the first alternative word block is formed by splicing a plurality of continuous characters in a sentence to be corrected; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block; if the preset word block list does not contain the first alternative word block, replacing a first character in the statement to be corrected by using the mixed character in the mixed character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block; if the preset word block list contains a second alternative word block, replacing the first alternative word block in the statement to be corrected with the second alternative word block to obtain a generated statement of the statement to be corrected.
And splicing a plurality of continuous characters in the sentence to be corrected into a first candidate word block, wherein the number of the characters of the plurality of continuous characters is more than or equal to 2 and less than or equal to the total number of the characters of the sentence to be corrected.
In some embodiments, the first character may be a first character in a first candidate word block.
The method comprises the steps of splicing a plurality of continuous characters in a sentence to be corrected into a first alternative word block, replacing the characters in the first alternative word block by using mixed characters to obtain a plurality of second alternative word blocks, searching the first alternative word blocks and the second alternative word blocks in a word block table, replacing the characters in the sentence to be corrected according to a searching result, and generating a plurality of error correction sentences corresponding to the sentence to be corrected, so that the mixed characters are used for replacing the characters in the sentence to be corrected, more errors in the sentence to be corrected can be searched, a large number of generated sentences can be generated for the sentence to be corrected, and because the generated sentences in each sentence pair are generated through the original sentences (namely the sentence to be corrected), the relation between the generated sentences and the original sentences can be considered, and the final sentence after error correction can be determined in the generated sentences based on the relation between the generated sentences and the original sentences, so that the effect of correcting the language is improved. If the sentence to be corrected is a correct sentence, the word block replacement processing is performed on the sentence to be corrected, and the obtained generated sentences are all error sentences. And exchanging the sentences to be corrected in the sentence pair with the generated sentences to obtain an exchanged first sentence pair, namely taking the sentences to be corrected in the original sentence pair as the generated sentences in the first sentence pair, and taking the generated sentences in the original sentence pair as the sentences to be corrected in the first sentence pair, wherein the generated sentences in the exchanged first sentence pair are correct sentences. Therefore, labeling of the sentences to be corrected and the generated sentences is not needed to be carried out manually, and the labeling cost can be greatly saved.
Optionally, the word block replacement processing is performed on the statement to be corrected by using a Viterbi algorithm, so as to obtain at least one generated statement.
In some embodiments, the preset word block table includes a plurality of preset word blocks; where a "word block" is a word of one or more characters.
And determining a subsequence formed by splicing a plurality of continuous characters in the sentence to be corrected as a first alternative word block. For example, the sentence to be corrected is "he is the visual speaker of the product", the method can obtain the products of the type "character", "image generation", "speech generation", "speaker", etc.
Optionally, the preset confusion character set includes at least one confusion character corresponding to the preset character; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block, wherein the method comprises the following steps: searching for an confusion character corresponding to at least one character in the first alternative word block in the confusion character set, and replacing the at least one corresponding character in the first alternative word block with the searched confusion character to obtain at least one second alternative word block. For example, the first candidate word block is "image", and if the confusion characters of the "image" are found in the confusion character set, such as "image", "want", etc., then the found confusion characters replace the "image" in the first candidate word block, and then the second candidate word block is obtained, such as "image", "want".
Optionally, replacing the first character in the statement to be corrected with the confusion character in the confusion character set includes: searching for the confusion character corresponding to the preset character identical to the first character in the confusion character set, and determining the confusion character corresponding to the searched preset character as the confusion character corresponding to the first character; and replacing the first character in the sentence to be corrected with the confusion character corresponding to the first character.
In some embodiments, the confusing character corresponding to the preset character includes: the method comprises the steps of presetting semantic confusion characters, homonym confusion characters, near-syllable confusion characters, shape near-word confusion characters and the like corresponding to characters. For example: the preset character is changed, and the corresponding semantic confusion characters are as follows: characters such as correction and check; homonym confusion characters corresponding to the "change" word are: characters such as "the", "the lid", "the general", "beggar", etc.; the near-phonetic character confusion characters corresponding to the 'modified' characters are as follows: characters such as "high", "notice", "give", "sea", "return", etc.; the shape and near word confusion characters corresponding to the 'changed' words are as follows: there are "administrative", "one", "put" and other characters.
Optionally, before replacing the first character in the sentence to be corrected with the confusion character in the confusion character set, obtaining the confusion character corresponding to the preset character in the confusion character set is further included.
Optionally, obtaining the semantically confused character of the preset character by using a preset deep language model. Optionally, the deep language model is a BERT model. Optionally, inputting the preset corpus into the BERT model, and performing a Masking (MASK) operation on the preset characters in the corpus, namely replacing the preset characters with marks [ MASK ] (marks [ MASK ] or called MASK marks herein); obtaining probability distribution of each alternative character in a preset dictionary on a [ MASK ] mark position by using a BERT model, wherein the [ MASK ] mark position is the position of the preset character in the preset corpus; the following is performed for each [ MASK ] marker position in turn: and selecting an alternative character corresponding to the maximum probability from probability distribution corresponding to the [ MASK ] mark position to serve as a semantic confusion character of the preset character. For example: the character 'change' is preset, and the corresponding characters such as 'correction' and 'check' of the semantic confusion characters are obtained.
The preset dictionary stores a plurality of candidate characters, wherein the candidate characters are all characters such as Chinese characters, pinyin, punctuation marks and the like in the prior art.
The mixed characters of the preset characters not only comprise homophones, near-phones and shape near-phones, but also comprise characters with the same semantic meaning, so that the mixed characters are used for replacing the characters in the first candidate word block, more errors in the sentence to be corrected can be found, more error correction sentences corresponding to the sentence to be corrected can be obtained, and therefore the generated sentence which is more fit with the semantic meaning of the original sentence can be obtained from the generated error correction sentences. For example, the sentence to be corrected is "it is necessary to build a highway", wherein the confusing character of "whisker" includes the semantic confusing character "want", i.e. it is necessary to obtain one of the generated sentences of the sentence as "build a highway".
Optionally, performing word block replacement processing on the statement to be corrected to obtain at least one generated statement, including: determining a subsequence formed by splicing a plurality of continuous characters in the sentence to be corrected as a first alternative word block; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block; if the preset word block list does not contain the first alternative word block and the second alternative word block, replacing a first character in the statement to be corrected by using the confusion character in the confusion character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block; if the preset word block list contains a second alternative word block, replacing the first alternative word block in the statement to be corrected with the second alternative word block to obtain a generated statement of the statement to be corrected.
In some embodiments, as shown in fig. 2, fig. 2 is a schematic diagram of a sentence to be corrected and a generated sentence, in fig. 2, the sentence to be corrected is "he is not only a spear but also a humour", each character is traversed forward from the character "he", a first alternative word block spliced with a plurality of characters continuous to "he" and a second alternative word block after at least one character in the first alternative word block is replaced by a confusing character thereof are not word blocks in a word block table, and then "he" is replaced by confusing characters thereof, "i", so as to obtain a generated sentence 1 "i is a spear and humour" of the sentence to be corrected; then traversing the character 'both', splicing the first alternative word blocks with the 'both' continuous characters 'warm', replacing the two characters in the 'both warm' with the second alternative word blocks after the mixed characters are 'kissing', replacing the 'both warm' with the mixed characters 'kissing', and obtaining a generation statement 2 'other kissing spear' of the statement to be corrected; then, traversing the character 'warm', wherein a first alternative word block 'Wen Mao' spliced with the character 'spear' continuous with the character 'warm' is not a word block in a word block list, but replacing the character 'spear' in 'Wen Mao' with a second alternative word block 'gentle' after confusing the character 'soft' is a word block in the word block list, and replacing 'Wen Mao' with 'gentle', so as to obtain a generation statement N 'he is gentle and humorous' of the statement to be corrected; and traversing the complete sentences to be corrected in sequence to obtain N generated sentences corresponding to the sentences to be corrected. Thus, N sentence pairs of the original sentence-generated sentence, for example, sentence pair "he is gentle and humorous" — "he is gentle and humorous".
Optionally, performing feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair, including: the following processing is performed for each sentence pair: extracting one or more of score characteristics, editing distance characteristics, confusion score characteristics and word block quantity difference characteristics of each sentence pair; one or more of the score feature, the edit distance feature, the confusion score feature and the word block number difference feature of the sentence pair are determined as the sentence pair feature of the sentence pair.
Optionally, extracting the score features of the sentence pairs includes: acquiring a first character probability of a character position for carrying out word block replacement processing in a sentence to be corrected in a sentence pair; acquiring a second character probability of the character position of the generated sentence in the sentence pair after being replaced by the word block; calculating a first difference between the second character probability and the first character probability; the first character probability, the second character probability, and the first difference value are determined as fractional features of sentence pairs.
Optionally, the score features of sentence pairs are obtained by using a preset depth language model. Optionally, the deep language model is a BERT model.
Optionally, obtaining a first character probability of a character position for performing word block replacement processing in the sentence to be corrected in the sentence pair includes: modifying a third character in the sentence to be corrected in the sentence pair into a mark [ MASK ], wherein the third character is a character for carrying out word block replacement processing; obtaining first probability distribution of [ MASK ] mark positions of alternative characters in a preset dictionary in sentences to be corrected by using a BERT language model, selecting highest probability in the first probability distribution as probability of a third character, and determining the probability of the third character as probability of the first character; if at least two third characters exist, determining the product of probabilities of all the third characters as the first character probability.
Optionally, obtaining a second character probability of the character position in the generated sentence in the sentence pair after being replaced by the word block includes: modifying a fourth character in the generated sentence in the sentence pair into a mark [ MASK ], wherein the fourth character is a character subjected to word block replacement processing; obtaining second probability distribution of [ MASK ] mark positions of alternative characters in a preset dictionary in a generation sentence by using the BERT language model, selecting highest probability in the second probability distribution as probability of a fourth character, and determining the probability of the fourth character as probability of the second character; if at least two fourth characters exist, the product of probabilities of all the fourth characters is determined as the second character probability.
The character subjected to word block replacement processing in the sentence to be corrected and the character subjected to word block replacement processing in the generated sentence are modified into the mark [ MASK ], the probability distribution of the [ MASK ] position is obtained through the BERT model, and the semantics of the characters before and after the character position are considered when the BERT model predicts the probability distribution of the character position, so that the capability of error word recognition can be improved, further, the score characteristics of sentence pairs are considered when the generated sentence is scored, the score of the generated sentence can be more accurate, and the accuracy of sentence error correction is improved.
In some embodiments, the sentence pair to be corrected is "he is gentle and humorous", the generated sentence is "he is gentle and humorous", and the probability of obtaining the first character corresponding to the sentence to be corrected by using the BERT model is 1.7X10 -6 The second character probability corresponding to the generated sentence is 0.992, and the first difference value between the second character probability and the first character probability is 0.992; the sentence pair has a score of "1.7X10 -6 、0.992、0.992”。
Optionally, extracting the edit distance feature of the sentence pair includes: calculating the Chinese character editing distance between the generated sentence and the sentence to be corrected in the sentence pair; converting the generated sentence and the sentence to be corrected in the sentence pair into pinyin sequences, and respectively calculating pinyin editing distances between the generated sentence and the sentence to be corrected in the sentence pair based on the pinyin sequences after conversion; and determining the Chinese character editing distance and the pinyin editing distance corresponding to the sentence pair as the editing distance characteristic of the sentence pair.
Optionally by calculationObtaining Chinese character editing distance between sentence to be corrected and generated sentence in sentence pair, wherein lev a,b And (i, j) is the Chinese character editing distance between the sentence a to be corrected in the sentence pair and the generated sentence b, i is the length of the sentence a to be corrected in the sentence pair, and j is the length of the generated sentence b in the sentence pair.
Optionally by calculationObtaining the spelling editing distance between the sentence to be corrected and the generated sentence in the sentence pair, wherein lev' a,b (i ', j') is the pinyin editing distance between the sentence a to be corrected in the sentence pair and the generated sentence b, i 'is the pinyin sequence length of the sentence a to be corrected in the sentence pair, and j' is the pinyin sequence length of the generated sentence b in the sentence pair. For example, the sentence to be corrected in the sentence pair is 'he has mild spear and humor', the generated sentence is 'he has mild and humor', the Chinese character editing distance between the sentence to be corrected and the generated sentence is 1, and the pinyin editing distance between the sentence to be corrected and the generated sentence is 3; the edit distance of the sentence pair is characterized as "1, 3".
The Chinese character editing distance and the pinyin editing distance between the sentence to be corrected and the generated sentence can reflect the Chinese character difference and the pinyin difference between the two sentences, and the editing distance characteristics of sentence pairs are considered when the generated sentence is scored, so that the score of the generated sentence is more accurate, and the accuracy of sentence correction is improved.
Optionally, extracting the confusion score feature of the sentence pair includes: obtaining a first confusion degree score of a sentence to be corrected in a sentence pair; obtaining a second confusion score of the sentence generated in the sentence pair; obtaining a second difference between the second confusion score and the first confusion score in the sentence pair; the first confusion score, the second confusion score, and the second difference value corresponding to the sentence pair are determined as confusion (PPL) score features of the sentence pair.
Optionally, a first confusion score for the sentence to be corrected in the sentence pair is obtained using a kenlm language model. Optionally, a second confusion score for the sentence generated in the sentence pair is obtained using a kenlm language model. For example, the sentence pair to be corrected is 'he is gentle and humorous', the generated sentence is 'he is gentle and humorous', the first confusion score of the sentence to be corrected is 106.3, the second confusion score of the generated sentence is 42.7, and the second difference between the second confusion score and the first confusion score is-63.6; the confusion score for that sentence pair is characterized as "106.3, 42.7, -63.6".
The PPL score features can represent the influence of correct words and error words on the overall smoothness of the sentences, so that the confusion score features of sentence pairs are considered when the generated sentences are scored, the overall smoothness of the sentences can be considered when the generated sentences are scored, and the accuracy of sentence error correction is improved.
Optionally, obtaining the word block quantity difference feature of the sentence pair includes: acquiring the number of first word blocks of a sentence to be corrected in a sentence pair; obtaining the number of second word blocks of the sentence generated in the sentence pair; obtaining a third difference value between the number of second word blocks and the number of first word blocks in the sentence pair; and determining the third difference value corresponding to the sentence pair as a word block quantity difference characteristic of the sentence pair.
For example, the sentence to be corrected in the sentence pair is "he is warm spear and humor", the generated sentence is "he is gentle and humor", the number of word blocks of the sentence to be corrected is 6, namely "he", "warm", "spear", "and" humor ", and the number of word blocks of the generated sentence is 5, namely" he "," warm "," gentle "," and "humor"; the third difference between the second number of word blocks and the first number of word blocks is-1, and the word block number difference is characterized by "-1".
Because after a certain character in a correct sentence is corrected, a correct word block containing the character is often split into a plurality of word blocks, the number of word blocks of the generated sentence is increased, or after a certain error character in the error sentence is corrected, the corrected character can be combined with adjacent characters to form the word blocks, and the number of word blocks of the generated sentence is reduced. Therefore, when scoring the generated sentences, the condition that the number of word blocks between the original sentences and the generated sentences is changed is considered, so that the scoring of the generated sentences is more accurate, and the accuracy of sentence error correction is improved.
And determining sentence pair characteristics of the sentence pairs by using the score characteristics, the editing distance characteristics, the confusion degree score characteristics and the word block quantity difference characteristics of the sentence pairs. In this way, the score characteristics of sentence pairs are considered when the generated sentences are scored, the situation that error characters exist between the sentences to be corrected and the generated sentences, the influence of correct characters and error characters in the sentences to be corrected and the generated sentences on the overall smoothness of the sentences is considered, and the situation that the number of word blocks between the sentences to be corrected and the generated sentences is changed is considered, so that the scoring of the generated sentences is more accurate, and the accuracy of text error correction is improved.
In some embodiments, the sentence to be corrected in the sentence pair is "he is gentle and humorous", and the generated sentence is "he is gentle and humorous". The first character probability is 1.7X10 -6 The second character probability corresponding to the generated sentence is 0.992, and the first difference value between the second character probability and the first character probability is 0.992; the sentence pair has a score of "1.7X10 -6 0.992, 0.992"; the Chinese character editing distance between the sentence to be corrected and the generated sentence is 1, and the pinyin editing distance between the sentence to be corrected and the generated sentence is 3; the edit distance characteristic of the sentence pair is "1, 3"; the first confusion degree score of the sentence to be corrected is 106.3, the second confusion degree score of the generated sentence is 42.7, and the second difference between the second confusion degree score and the first confusion degree score is-63.6; the confusion score for that sentence pair is characterized as "106.3, 42.7, -63.6"; the number of word blocks of the sentence to be corrected is 6, the number of word blocks of the generated sentence is 5, the third difference between the second number of word blocks and the first number of word blocks is-1, and the word block number difference characteristic is "-1". The sentence pair feature of the sentence pair is obtained as (1.7 e-6,0.992, 0.992,1,3, 106.3, 42.7, -63.6, -1) and the sentence pair feature of the sentence pair is a one-dimensional feature vector.
Optionally, scoring each generated sentence according to the sentence pair feature of each sentence pair to obtain a score of each generated sentence, including: obtaining weight vectors of sentence pair features; and performs the following processing for each sentence pair: and scoring the generated sentences in the sentence pairs according to the sentence pair characteristics and the weight vectors of the sentence pairs to obtain the scores of the generated sentences in the sentence pairs.
Optionally, obtaining a weight vector of sentence-pair features includes: obtaining a preset sample sentence, carrying out word block replacement processing on the sample sentence, and obtaining at least one sample generation sentence corresponding to the sample sentence; combining each sample generation sentence with the sample sentences to form at least one sample sentence pair corresponding to the at least one sample generation sentence one by one, and carrying out feature extraction processing on each sample sentence pair to obtain sentence pair features of each sample sentence pair; and calculating the sentence-pair features by using a gradient descent algorithm to obtain weight vectors of the sentence-pair features.
Optionally by calculationObtaining a predicted loss value, wherein L is the predicted loss value, N is the number of sample sentence pairs, and x n Sentence pair feature, y, being the nth sample sentence pair n Generating a label of a sentence for a sample in an nth sample sentence pair, wherein W is a weight vector of sentence pair characteristics, T is transposition operation, N is more than or equal to 1 and less than or equal to N, and N is an integer.
Optionally, random initialization operation is adopted when the sentence-to-feature weight vector is initialized, namely the sentence-to-feature initialization weight vector is random initialization, iteration update is continuously carried out in the subsequent training process, and training is stopped when the preset training times are reached or the gradient value reaches a set threshold value, so that the sentence-to-feature weight vector is obtained.
Optionally, in the case that the sample generation statement is a correct statement, the label of the sample generation statement is 1; in the case where the sample generation statement is an error statement, the tag of the sample generation statement is 0.
Thus, when the sample sentence is a correct sentence, the block replacement processing is performed on the sample sentence, and the obtained sample generation sentences are all error sentences, and the labels of the sample generation sentences are all 0. Exchanging the sample sentences in the sample sentence pairs with sample generation sentences to obtain an exchanged first sample sentence pair, namely taking the sample sentences as sample generation sentences in the first sample sentence pairs, and taking the sample generation sentences as sample sentences in the first sample sentence pairs; the sample generation sentences in the exchanged first sample sentence pair are correct sentences, i.e. the labels of the sample generation sentences in the exchanged first sample sentence pair are 1. Therefore, labeling is not needed to be carried out on the sample sentences and the sample generation sentences manually, and the labeling cost is greatly saved.
Optionally by calculationObtaining gradient values, wherein->For gradient values, θ is the activation function, x n Sentence pair feature, y, being the nth sample sentence pair n Generating labels of sentences for samples in an nth sample sentence pair, wherein W is a weight vector of sentence pair characteristics, T is transposition operation, N is the number of sample sentence pairs, N is more than or equal to 1 and less than or equal to N, and N is an integer. Alternatively, the activation function is +.>Where s is an input parameter of the activation function and e is a natural constant.
Optionally by calculationObtaining a weight vector of the feature; wherein alpha is learning rate, W t Weight vector for the t-th iteration, W t+1 Weight vector for the t+1st iteration. For example, if the preset training time is t+1 times, determining the weight vector obtained in the t+1 th iteration as the weight vector W of the sentence-to-feature o
Optionally, scoring the generated sentences in the sentence pair according to the sentence pair characteristics of the sentence pair and the corresponding weight vectors to obtain the score of the generated sentences in the sentence pair, including: by calculating score m =W o ·X m Obtaining the score of the generated sentence; wherein score m Score of generated sentence in mth sentence pair, W o Weight vector, X, corresponding to sentence pair feature m Is the sentence pair feature of the mth sentence pair.
Optionally, correcting the statement to be corrected according to the score of the generated statement to obtain a corrected statement, including: selecting a generated sentence corresponding to the highest score as an alternative sentence; and determining the alternative statement as a statement after error correction corresponding to the statement to be error corrected. The generation statement with the highest score is the optimal error correction statement, so that the optimal error correction statement can be selected according to different error correction requirements. According to the error correction statement obtained by the error correction method, the relation between the generated error correction statement and the original sentence is considered, and under the condition that a plurality of correction methods exist in a certain sentence in the text to be corrected, the final error correction statement can be determined in the generated error correction statement based on the relation between the error corrected statement and the original sentence, so that the effect of correcting the text is improved.
As shown in connection with fig. 3, an embodiment of the present disclosure provides another method for statement error correction, including:
step S301, determining a sub-sequence formed by splicing a plurality of continuous characters in the sentence to be corrected as a first candidate word block.
Step S302, replacing at least one character in the first candidate word block by using the confusion character in the preset confusion character set to obtain at least one second candidate word block.
Step S303, if the preset word block list does not contain the first alternative word block, replacing the first character in the statement to be corrected by using the confusion character to obtain a generated statement of the statement to be corrected; if the word block list contains the second alternative word block, replacing the first alternative word block in the statement to be corrected with the second alternative word block to obtain a generated statement of the statement to be corrected.
Step S304, each generated sentence is respectively combined with the sentences to be corrected to form at least one sentence pair corresponding to at least one generated sentence one by one.
Step S305, carrying out feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair.
And step S306, scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence.
Step S307, selecting the generated sentence corresponding to the highest score as the candidate sentence.
Step S308, determining the alternative sentence as the sentence after error correction corresponding to the sentence to be error corrected.
The sentence pairs are formed by the error correction sentences after the word block replacement processing is carried out on the sentences to be corrected, the word block replacement processing can be carried out on the same sentence for a plurality of times, a plurality of sentence pairs can be generated under the condition that the word block replacement processing is carried out on the same sentence for a plurality of times, and the error correction sentences generated by the sentence pair feature pairs of the sentence pairs are scored, so that different error correction sentences can be selected according to different scores according to different error correction requirements. The error correction statement obtained by the error correction method considers the relation between the generated error correction statement and the original statement, and can determine the final error correction statement in the generated error correction statement based on the relation between the error corrected error correction statement and the original statement under the condition that various correction methods exist in the statement to be corrected, so that the effect of correcting the statement is improved.
Optionally, obtaining an optimal error correction statement from the generated statement corresponding to the statement to be corrected by using a Beam Search (Beam Search) mode, wherein the optimal error correction statement is the statement after error correction corresponding to the statement to be corrected. Specifically, a search beam (beam), i.e., beam= [ Root ], is initialized; traversing each correction (correction) segment in the beam, wherein the correction segments are a plurality of continuous characters in the statement to be corrected, and the correction segments comprise a first character in the statement to be corrected, namely, the character is the first character in the statement to be corrected when the correction segments only comprise one character; in the case where the correction section is not the same as the length of the sentence to be corrected, the following operations are performed on the first character after the correction section until the length of the correction section is the same as the length of the sentence to be corrected: splicing the first character after the correction segment with a plurality of continuous characters to obtain a first alternative sequence; replacing at least one character in the first alternative sequence by using a mixed character in a preset mixed character set to obtain at least one second alternative sequence; if the word block list does not contain the first alternative sequence, determining a first character in the first alternative sequence as a candidate word block; if the word block list comprises at least one second alternative sequence, determining the at least one second alternative sequence as a candidate word block; the method comprises the steps of replacing characters at corresponding positions in a sentence to be corrected with at least one candidate word block, obtaining at least one candidate generation sentence, obtaining the score of each candidate generation sentence, determining a candidate word block corresponding to the candidate generation sentence with the highest score as an optimal candidate word block, splicing the optimal candidate word blocks after correction fragments respectively, and storing each spliced correction fragment into a beam; then, determining the correction segment with the same length as the statement to be corrected as a generated statement corresponding to the statement to be corrected; each generated sentence is respectively combined with the sentences to be corrected to form sentence pairs corresponding to the generated sentences one by one; extracting features of each sentence pair to obtain sentence pair features of each sentence pair; scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence; and selecting a generated statement corresponding to the highest score to determine the generated statement as an optimal error correction statement, namely a statement after error correction corresponding to the statement to be subjected to error correction.
In some embodiments, as shown in fig. 4, an optimal error correction sentence is obtained from the generation sentences of the sentence to be corrected by using a bundle search mode, firstly, the (input) sentence to be corrected, "he is a gentle spear and a gentle man", is input, the beam= [ Root ] is initialized, then each correction segment in the beam is traversed to generate candidate word blocks, the candidate word blocks replace characters in corresponding positions in the sentence to be corrected to obtain candidate generation sentences, each candidate generation sentence is respectively combined with the sentence to be corrected to form sentence pairs, the score of each candidate generation sentence is obtained according to the sentence pair characteristics of each sentence pair, the candidate word block corresponding to the candidate generation sentence with the highest score is determined to be the optimal candidate word block, and the latest correction segment set beam is obtained after the optimal candidate word blocks are spliced to the correction segments. Continuously updating according to the method until correction fragments in the beam are equal to the sentences to be corrected in length, determining the correction fragments with the lengths consistent with the sentences to be corrected as generation sentences corresponding to the sentences to be corrected, and obtaining the score of each generation sentence; and determining the generated statement corresponding to the highest score as an optimal error correction statement, namely, the statement to be subjected to error correction corresponds to the corrected statement, and outputting an error corrected correction (correction) statement 'he is a gentle and humorous person'.
Referring to fig. 5, an embodiment of the disclosure provides an apparatus for statement error correction, including: a substitution module 501, a sentence pair generation module 502, a feature extraction module 503, a scoring module 504, and an error correction module 505. A replacing module 501 configured to perform word block replacement processing on the statement to be corrected to obtain at least one generated statement; sentence pair generating module 502, configured to combine each generated sentence with the sentence to be corrected, to form at least one sentence pair corresponding to at least one generated sentence one by one; a feature extraction module 503 configured to perform feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair; a scoring module 504 configured to score each generated sentence according to the sentence pair characteristics of each sentence pair, to obtain a score for each generated sentence; the error correction module 504 is configured to correct the statement to be corrected according to the score, and obtain a statement after error correction.
By adopting the sentence correction device provided by the embodiment of the disclosure, the sentence pairs are formed by the sentences to be corrected and the error correction sentences subjected to the word block replacement processing, so that the word block replacement processing can be performed on the same sentence for a plurality of times, a plurality of sentence pairs can be generated under the condition that the word block replacement processing is performed on the same sentence for a plurality of times, and the error correction sentences generated by the sentence pair feature pairs of the sentence pairs are scored, so that different error correction sentences can be selected according to different scores according to different error correction requirements. The error correction statement obtained by the error correction method considers the relation between the generated error correction statement and the original statement, and can determine the final error correction statement in the generated statement based on the relation between the generated error correction statement and the original statement, thereby improving the error correction effect on the statement.
Optionally, the replacing module performs word block replacement processing on the statement to be corrected to obtain at least one generated statement by the following manner: acquiring a first alternative word block, wherein the first alternative word block is formed by splicing a plurality of continuous characters in a sentence to be corrected; replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block; if the preset word block list does not contain the first alternative word block, replacing a first character in the statement to be corrected by using the mixed character in the mixed character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block; if the preset word block list contains a second alternative word block, replacing the first alternative word block in the statement to be corrected with the second alternative word block to obtain a generated statement of the statement to be corrected.
Optionally, the feature extraction module performs feature extraction processing on each sentence pair to obtain a sentence pair feature of each sentence pair in the following manner: the following processing is performed for each sentence pair: extracting one or more of score features, edit distance features, confusion score features and word block quantity difference features of sentence pairs; one or more of the score feature, the edit distance feature, the confusion score feature and the word block number difference feature of the sentence pair are determined as the sentence pair feature of the sentence pair.
Optionally, the feature extraction module extracts the fractional features of sentence pairs by: acquiring a first character probability of a character position for carrying out word block replacement processing in a sentence to be corrected in a sentence pair; acquiring a second character probability of the character position of the generated sentence in the sentence pair after being replaced by the word block; calculating a first difference between the second character probability and the first character probability; the first character probability, the second character probability, and the first difference value are determined as fractional features of sentence pairs.
Optionally, the feature extraction module extracts edit distance features of sentence pairs by: calculating the Chinese character editing distance between the generated sentence and the sentence to be corrected in the sentence pair; converting the generated sentence and the sentence to be corrected in the sentence pair into pinyin sequences, and respectively calculating pinyin editing distances between the generated sentence and the sentence to be corrected in the sentence pair based on the pinyin sequences after conversion; and determining the Chinese character editing distance and the pinyin editing distance corresponding to the sentence pair as the editing distance characteristic of the sentence pair.
Optionally, the feature extraction module extracts the confusion score feature of the sentence pair by: obtaining a first confusion degree score of a sentence to be corrected in a sentence pair; obtaining a second confusion score of the sentence generated in the sentence pair; obtaining a second difference between the second confusion score and the first confusion score in the sentence pair; and determining the first confusion degree score, the second confusion degree score and the second difference value corresponding to the sentence pair as the confusion degree score characteristic of the sentence pair.
Optionally, the feature extraction module obtains the word block number difference feature of the sentence pair by: acquiring the number of first word blocks of a sentence to be corrected in a sentence pair; obtaining the number of second word blocks of the sentence generated in the sentence pair; obtaining a third difference value between the number of second word blocks and the number of first word blocks in the sentence pair; and determining the third difference value corresponding to the sentence pair as a word block quantity difference characteristic of the sentence pair.
Optionally, the scoring module scores each generated sentence according to the sentence pair characteristics of each sentence pair to obtain a score of each generated sentence by: obtaining weight vectors of sentence pair features; and performs the following processing for each sentence pair: and scoring the generated sentences in the sentence pairs according to the sentence pair characteristics and the weight vectors of the sentence pairs to obtain the scores of the generated sentences in the sentence pairs.
Optionally, the error correction module performs error correction on the statement to be corrected according to the score to obtain an error corrected statement by: selecting a generated sentence corresponding to the highest score as an alternative sentence; and determining the alternative statement as a statement after error correction corresponding to the statement to be error corrected.
As shown in connection with fig. 6, an embodiment of the present disclosure provides an electronic device including a processor 600 and a memory 601 storing program instructions. Optionally, the electronic device may also include a communication interface (Communication Interface) 602 and a bus 603. The processor 600, the communication interface 602, and the memory 601 may communicate with each other via the bus 603. The communication interface 602 may be used for information transfer. The processor 600 may call program instructions in the memory 601 to perform the statement error correction method of the above-described embodiment.
Further, the program instructions in the memory 601 described above may be implemented in the form of software functional units and sold or used as a separate product, and may be stored in a computer readable storage medium.
The memory 601 serves as a computer readable storage medium, and may be used to store a software program, a computer executable program, and program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 600 executes the functional applications and data processing by running the program instructions/modules stored in the memory 601, i.e. implements the method of sentence correction in the above-described embodiments.
The memory 601 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 601 may include a high-speed random access memory, and may also include a nonvolatile memory.
Optionally, the electronic device is a computer.
The embodiment of the disclosure provides a storage medium, which stores program instructions, wherein the program instructions execute the statement error correction method when in operation.
The disclosed embodiments provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-described method for sentence correction.
The computer readable storage medium may be a transitory computer readable storage medium or a non-transitory computer readable storage medium.
Embodiments of the present disclosure may be embodied in a software product stored on a storage medium, including one or more instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of a method according to embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium including: a plurality of media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or a transitory storage medium.
The above description and the drawings illustrate embodiments of the disclosure sufficiently to enable those skilled in the art to practice them. Other embodiments may involve structural, logical, electrical, process, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in, or substituted for, those of others. Moreover, the terminology used in the present application is for the purpose of describing embodiments only and is not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a," "an," and "the" (the) are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, when used in this application, the terms "comprises," "comprising," and/or "includes," and variations thereof, mean that the stated features, integers, steps, operations, elements, and/or components are present, but that the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of other like elements in a process, method or apparatus comprising such elements. In this context, each embodiment may be described with emphasis on the differences from the other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the methods, products, etc. disclosed in the embodiments, if they correspond to the method sections disclosed in the embodiments, the description of the method sections may be referred to for relevance.
Those of skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. The skilled artisan may use different methods for each particular application to achieve the described functionality, but such implementation should not be considered to be beyond the scope of the embodiments of the present disclosure. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the embodiments disclosed herein, the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be practiced in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units may be merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to implement the present embodiment. In addition, each functional unit in the embodiments of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than that disclosed in the description, and sometimes no specific order exists between different operations or steps. For example, two consecutive operations or steps may actually be performed substantially in parallel, they may sometimes be performed in reverse order, which may be dependent on the functions involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims (12)

1. A method for statement correction, comprising:
performing word block replacement processing on the statement to be corrected to obtain at least one generated statement;
combining each generated sentence with the sentence to be corrected to form at least one sentence pair corresponding to the at least one generated sentence one by one;
extracting features of each sentence pair to obtain sentence pair features of each sentence pair;
scoring each generated sentence according to sentence pair characteristics of each sentence pair to obtain the score of each generated sentence;
and correcting the statement to be corrected according to the score to obtain the statement after correction.
2. The method of claim 1, wherein performing a word block replacement process on the statement to be corrected to obtain at least one generated statement comprises:
acquiring a first alternative word block, wherein the first alternative word block is formed by splicing a plurality of continuous characters in the sentence to be corrected;
replacing at least one character in the first candidate word block by using a mixed character in a preset mixed character set to obtain at least one second candidate word block;
if the first alternative word block is not contained in the preset word block list, replacing a first character in the statement to be corrected by using the mixed character in the mixed character set to obtain a generated statement of the statement to be corrected, wherein the first character is any character in the first alternative word block;
If the second alternative word block is contained in the preset word block table, replacing the first alternative word block in the statement to be corrected with the second alternative word block, and obtaining a generated statement of the statement to be corrected.
3. The method of claim 1, wherein performing feature extraction processing on each of the sentence pairs to obtain sentence pair features of each of the sentence pairs, comprises:
the following processing is performed for each of the sentence pairs:
extracting one or more of score features, edit distance features, confusion score features and word block quantity difference features of the sentence pairs;
and determining one or more of the score feature, the editing distance feature, the confusion degree score feature and the word block quantity difference feature of the sentence pair as the sentence pair feature of the sentence pair.
4. The method of claim 3, wherein extracting the fractional features of the sentence pairs comprises:
acquiring a first character probability of a character position for carrying out word block replacement processing in the sentence to be corrected in the sentence pair;
acquiring a second character probability of the character position of the generated sentence in the sentence pair after being replaced by the word block;
calculating a first difference between the second character probability and the first character probability;
And determining the first character probability, the second character probability and the first difference value as the score characteristic of the sentence pair.
5. The method of claim 3, wherein extracting edit distance features of the sentence pairs comprises:
calculating the Chinese character editing distance between the generated sentence and the sentence to be corrected in the sentence pair;
converting the generated sentences and the sentences to be corrected in the sentence pairs into pinyin sequences, and respectively calculating pinyin editing distances between the generated sentences and the sentences to be corrected in the sentence pairs based on the pinyin sequences after conversion;
and determining the Chinese character editing distance and the pinyin editing distance corresponding to the sentence pair as the editing distance characteristic of the sentence pair.
6. The method of claim 3, wherein extracting the confusion-score feature for the sentence pair comprises:
acquiring a first confusion degree score of the sentence pair to be corrected;
obtaining a second confusion score of the sentence generated in the sentence pair;
obtaining a second difference between the second confusion score and the first confusion score in the sentence pair;
and determining the first confusion degree score, the second confusion degree score and the second difference value corresponding to the sentence pair as the confusion degree score characteristic of the sentence pair.
7. The method of claim 3, wherein obtaining a word block number difference feature for the sentence pair comprises:
acquiring the number of first word blocks of the sentence to be corrected in the sentence pair;
acquiring the number of second word blocks of the sentence generated in the sentence pair;
obtaining a third difference value between the number of second word blocks and the number of first word blocks in the sentence pair;
and determining the third difference value corresponding to the sentence pair as a word block quantity difference characteristic of the sentence pair.
8. The method according to any one of claims 1 to 7, wherein scoring each of the generated sentences according to sentence pair characteristics of each of the sentence pairs to obtain a score for each of the generated sentences comprises:
acquiring weight vectors of the sentence pair features; and for each of the sentence pairs, performing the following:
and scoring the generated sentences in the sentence pairs according to the sentence pair characteristics and the weight vectors of the sentence pairs to obtain the scores of the generated sentences in the sentence pairs.
9. The method according to any one of claims 1 to 7, wherein correcting the statement to be corrected according to the score to obtain a corrected statement comprises:
selecting a generated sentence corresponding to the highest score as an alternative sentence;
And determining the alternative statement as the statement after error correction corresponding to the statement to be error corrected.
10. An apparatus for sentence correction, comprising:
the replacing module is configured to perform word block replacing processing on the statement to be corrected to obtain at least one generated statement;
the sentence pair generating module is configured to combine each generating sentence with the sentence to be corrected respectively to form at least one sentence pair corresponding to the at least one generating sentence one by one;
the feature extraction module is configured to perform feature extraction processing on each sentence pair to obtain sentence pair features of each sentence pair;
the scoring module is configured to score each generated sentence according to sentence pair characteristics of each sentence pair, and obtain the score of each generated sentence;
and the error correction module is configured to correct the statement to be corrected according to the score to obtain an corrected statement.
11. An electronic device comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the method of statement correction as claimed in any one of claims 1 to 9 when the program instructions are executed.
12. A storage medium storing program instructions which, when executed, perform the method of statement correction as claimed in any one of claims 1 to 9.
CN202210041159.2A 2022-01-14 2022-01-14 Statement error correction method and device, electronic equipment and storage medium Pending CN116484842A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210041159.2A CN116484842A (en) 2022-01-14 2022-01-14 Statement error correction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210041159.2A CN116484842A (en) 2022-01-14 2022-01-14 Statement error correction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116484842A true CN116484842A (en) 2023-07-25

Family

ID=87214239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210041159.2A Pending CN116484842A (en) 2022-01-14 2022-01-14 Statement error correction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116484842A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592468A (en) * 2024-01-19 2024-02-23 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592468A (en) * 2024-01-19 2024-02-23 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN106598939B (en) A kind of text error correction method and device, server, storage medium
US6879951B1 (en) Chinese word segmentation apparatus
CN109960728B (en) Method and system for identifying named entities of open domain conference information
JP2013117978A (en) Generating method for typing candidate for improvement in typing efficiency
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
EP2447854A1 (en) Method and system of automatic diacritization of Arabic
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN110532569B (en) Data collision method and system based on Chinese word segmentation
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
JP2007156545A (en) Symbol string conversion method, word translation method, its device, its program and recording medium
CN116484842A (en) Statement error correction method and device, electronic equipment and storage medium
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN115688703B (en) Text error correction method, storage medium and device in specific field
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
JP3309174B2 (en) Character recognition method and device
JP7102710B2 (en) Information generation program, word extraction program, information processing device, information generation method and word extraction method
CN114462427A (en) Machine translation method and device based on term protection
CN114048733A (en) Training method of text error correction model, and text error correction method and device
CN111695350B (en) Word segmentation method and word segmentation device for text
CN114330375A (en) Term translation method and system based on fixed paradigm
KR100910275B1 (en) Method and apparatus for automatic extraction of transliteration pairs in dual language documents
JP5057916B2 (en) Named entity extraction apparatus, method, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination