CN116090441B

CN116090441B - Chinese spelling error correction method integrating local semantic features and global semantic features

Info

Publication number: CN116090441B
Application number: CN202211740208.8A
Authority: CN
Inventors: 夏振涛; 李艳; 朱立烨
Original assignee: Yozosoft Co ltd
Current assignee: Yozosoft Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-10-20
Anticipated expiration: 2042-12-30
Also published as: CN116090441A

Abstract

The invention provides a Chinese spelling error correction method integrating local semantic features and global semantic features, which comprises the following steps: for a document, a sentence collection is obtained through a sentence dividing module; for each sentence, obtaining error correction suggestions through a pipeline type error correction model and an end-to-end error correction model; in order to prevent correct word error correction, obtaining error correction suggestions through an error correction filtering module; and finally, outputting the end-to-end error correction model and the pipeline error correction model to obtain the final output correct sentence and correct document through the model fusion module. The invention has the advantages of wide error correction range, high error correction accuracy and the like.

Description

Chinese spelling error correction method integrating local semantic features and global semantic features

Technical Field

The invention relates to the field of Internet, in particular to a Chinese spelling error correction method integrating local semantic features and global semantic features.

Background

Chinese spelling correction is an important technology for automatic sentence checking and automatic correction in text correction, and aims to improve word correctness and reduce manual verification cost. In government affairs, media, law, education and other industries, manuscript writing occupies an important position, and the traditional manual correction workload is huge, so that the intelligent and accurate error correction system has wide application prospect. In the face of the complex and various characteristics of Chinese text semantics, a set of Chinese spelling error correction system which fuses local semantic features and global semantic features is designed.

Therefore, it is necessary to provide a new technical solution.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention discloses a Chinese spelling error correction method integrating local semantic features and global semantic features, which comprises the following specific technical scheme:

the invention provides a Chinese spelling error correction method integrating local semantic features and global semantic features, which comprises the following steps:

for a document, a sentence collection is obtained through a sentence dividing module;

for each sentence, obtaining error correction suggestions through a pipeline type error correction model and an end-to-end error correction model;

in order to prevent correct word error correction, obtaining error correction suggestions through an error correction filtering module;

and finally, outputting the end-to-end error correction model and the pipeline error correction model to obtain the final output correct sentence and correct document through the model fusion module.

Further, the end-to-end error correction model acquires semantic vector representation of each character in the sentence by using a feature encoder based on a transducer architecture, transmits the semantic vector representation into a feedforward neural network prediction vocabulary, introduces constraint rules based on voice similarity and shape similarity at an output end, and for each position in the original sentence, the feedforward neural network picks out the candidate characters with the highest probability of the first k corresponding to the prediction vocabulary,

traversing candidate characters in turn according to the probability size:

if the original character is a punctuation mark, maintaining the original character;

if the predicted character is the original character or is not in the vocabulary, maintaining the original character;

if the pronunciation and the font similarity of the predicted character and the original character are within the threshold value, the predicted character is used as an answer, otherwise, the method traverses backwards; if the traversal is finished, no character satisfying the condition is found, and the original character is maintained.

Furthermore, the pipeline type error correction model adopts a three-stage error correction method of error detection, candidate recall and candidate sorting.

Further, the error detection adopts a method based on a global semantic model and a global semantic model,

the method based on the local semantic model comprises the following steps:

s1, mining a bi-gram dictionary through a large-scale field corpus, counting word frequencies, obtaining a bi-gram word sequence after a sentence to be predicted is segmented, and if the bi-gram word is not in the dictionary and the word frequency is smaller than a set threshold value, considering that the word is possibly wrong, and adding candidate wrong words;

s2, training a 5-element ngram language model through a large-scale field corpus, wherein for a given Chinese character string, if any error exists in a sentence, the error word can appear in continuous single words, the error word can appear after Chinese word segmentation, and the probability of one character in the bi-gram model only depends on the word immediately before, wherein the probability of the character string is approximated by the product of the following series of conditional probabilities:

the probability of each term in the above equation may be calculated from the maximum likelihood estimates:

its N (c) _l-1 ，c _l )N(c _l-1 ) And respectively represent the number of occurrences of the character string in the given corpus,

if the probability of bi-gram word is greater than the set threshold, then the word is added to the candidate wrong word.

Further, the candidate recalls include pronunciation-like recalls and font-like recalls,

the pronunciation similarity recall is divided into word recall and word recall for each error candidate word by constructing a similar pinyin dictionary from a bi-gram word library obtained by large-scale corpus mining:

the word recall finds all the sounds of the word in a recall library of word sounds, and then finds all the words of the sounds according to the sounds to add a candidate set;

the word recall obtains all the pinyin of each word for each word, combines all the word sounds to obtain all the word sounds, finds all the words of each pinyin in the pinyin library according to each word sound, adds the candidate set,

and recalling the similar word by each word in the word according to a similar word library, combining all the words to obtain the similar word, calculating the similarity of each similar word and the wrong candidate word, and adding the similar word into the candidate set if the similarity value is larger than a set threshold value.

Further, the candidate ranking obtains correction suggested words through coarse ranking and fine ranking for recall candidate sets of wrong words,

in the thick line, each word is calculated according to the following formula using the ngram language model

The first k candidate words are obtained after the ngram-score is ranked from high to low, and then fine ranking is performed;

in the fine line, the score of the sentence is calculated according to the confusion degree (ppl), and then the degree of the sentence after the replacement of the correction recommended word is evaluated: the smaller the ppl value, the more smooth the sentence, the better the correction suggestion word,

the confusion degree calculation formula is as follows:

wherein S is the current sentence, N is the sentence length, p (wi) is the probability of the ith word, p (wi|w1w 2 … … wi-1) represents the probability of the ith word calculated based on the previous i-1 word,

and sorting the ppl values of the candidate words from low to high, setting different thresholds according to sentence lengths, traversing the sorting result, adding the current word into the error correction suggestion set if the current value is smaller than the threshold, and finally taking the words at the lower 1 position of the error correction suggestion word set as final error correction suggestion words.

Further, the error correction filtering module includes the following steps:

1) Calculating ngram score and confusion degree by using the depth language model and the statistical language model respectively;

2) Averaging the ngram scores and the confusion degrees obtained by the two models to obtain a mean ngram score and a mean confusion degree;

3) Taking the maximum value of the ngram score;

4) Taking the minimum value of the confusion degree;

5) Multiplying the length of the word by the minimum confusion, and calculating the confusion difference of the two models;

6) Calculating a score of the word: mean confusion degree the length of the word- (mean ngram score) confusion degree difference/maximum ngram score).

Further, the model fusion module selects the word with the minimum score value calculated last in the error correction module as output.

The invention has the following beneficial effects:

1. the Chinese spelling error correction method integrating the local semantic features and the global semantic features improves the correctness of words and reduces the manual verification cost.

2. According to the Chinese spelling error correction method integrating the local semantic features and the global semantic features, the error correction suggestions are obtained through the pipeline type error correction model and the end-to-end error correction model, and the error correction range is wider.

3. According to the Chinese spelling error correction method integrating the local semantic features and the global semantic features, the error correction proposal is obtained through the error correction filtering module, the error correction is more accurate, the error correction accuracy reaches 94.53%, the recall rate is 87.22%, and the error correction rate is 2.97%. Accuracy = number of predicted errors/total number of samples tested, recall = number of predicted errors/number of samples with errors in samples with errors, error correction = number of errors predicted in samples without errors/total number of samples without errors.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a system provided in an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "top", "bottom", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The invention discloses a Chinese spelling error correction method integrating local semantic features and global semantic features, which refers to FIG. 1 and comprises the following steps:

The end-to-end error correction model acquires semantic vector representation of each character in a sentence by using a feature encoder based on a transducer architecture, transmits the semantic vector representation into a feedforward neural network prediction vocabulary, introduces constraint rules based on voice similarity and shape similarity at an output end, selects the first k candidate characters with the highest probability in the prediction vocabulary according to each position in an original sentence by using the feedforward neural network,

traversing candidate characters in turn according to the probability size:

The pipeline type error correction model adopts a three-stage error correction method of error detection, candidate recall and candidate sorting.

The error detection adopts a method based on a global semantic model and a global semantic model,

the method based on the local semantic model comprises the following steps:

n (c) where it appears _l-1 ，c _l ) Secondary N (c) _l-1 ) In numbers, and representing strings in a given corpus, respectively

The candidate recalls include pronunciation-like recalls and glyph-like recalls,

The candidate ranking obtains correction suggested words through coarse ranking and fine ranking for recall candidate sets of wrong words,

in the coarse row, an ngram-score for each word is calculated using an ngram language model according to the following formula

the confusion degree calculation formula is as follows:

The error correction filtering module comprises the following steps:

3) Taking the maximum value of the ngram score;

4) Taking the minimum value of the confusion degree;

The model fusion module selects the word with the minimum score value calculated last in the error correction module as output.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Further, one skilled in the art may combine and combine the different embodiments or examples described in this specification.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications and alternatives to the above embodiments may be made by those skilled in the art within the scope of the invention.

Claims

1. A Chinese spelling error correction method integrating local semantic features and global semantic features is characterized by comprising the following steps:

and finally outputting the end-to-end error correction model and the pipeline error correction model to obtain the final output correct sentence and correct document through the model fusion module,

traversing candidate characters in turn according to the probability size:

if the pronunciation and the font similarity of the predicted character and the original character are within the threshold value, the predicted character is used as an answer, otherwise, the method traverses backwards; if the traversal is over, no character is found that satisfies the condition, the original character is maintained,

the pipeline type error correction model adopts a three-stage error correction method of error detection, candidate recall and candidate sorting,

the method based on the local semantic model comprises the following steps:

wherein N (c) _l-1 ,c _l ) And N (c) _l-1 ) Respectively representing the number of occurrences of a string in a given corpus,

if the probability of bi-gram words is greater than the set threshold, then the word is added to the candidate erroneous word,

the word pattern similarity recall recalls the similar word patterns according to a similar word pattern library for each error candidate word, all word patterns are combined to obtain similar words, the similarity is calculated for each similar word and the error candidate word, if the similarity value is larger than a set threshold value, the similar words are added into a candidate set,

in the thick line, the ngram-score of each word is calculated according to the following formula using the ngram language model,

in the fine line, sentence score is calculated according to confusion (ppl), and then the degree of the sentence after correction of the recommended word replacement is evaluated: the smaller the ppl value, the more smooth the sentence, the better the correction of the suggested word, the confusion degree calculation formula is as follows:

wherein S is the current sentence, N is the sentence length, p (w _i ) For the probability of the ith word, p (w _i |w ₁ w ₂ …w _i-1 ) Representing the probability of computing the ith word based on the first i-1 words,

2. The method for chinese spelling error correction for fusing local and global semantic features of claim 1 wherein the error correction filtering module comprises the steps of:

3) Taking the maximum value of the ngram score;

4) Taking the minimum value of the confusion degree;

3. The method for correcting Chinese spelling errors by combining local semantic features and global semantic features according to claim 2, wherein the model combining module selects the word with the smallest score value calculated last in the error correcting module as output.