The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention provides a kind of based on the non-of fuzzy participle
Multi-character words mistake auto-collation.
Technical scheme:In order to solve the above technical problems, the present invention provides a kind of non-multi-character word error based on fuzzy participle
Auto-collation, this method carry out automatic Proofreading by the method for fuzzy participle, comprised the following steps:
1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party
Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary
Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure
In;
2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string
Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed
Fuzzy participle word figure;
3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most
Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non-
Multi-character words mistake automatic Proofreading.
Preferably, the step 1) comprises the following steps:
Step 11) establishes the even numbers group Trie tree constructions DicTrie of correct word dictionary;
Step 12) establishes the even numbers group Trie tree constructions TypoDicTrie of wrongly written character word dictionary:(TypoWord,
CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word;
Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described
Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure;
Even numbers group Trie tree construction TypoDicTrie of the step 14) based on wrongly written character word dictionary, using maximum matching process pair
The Chinese sentence carries out Precise Segmentation, and sentence is marked:By the wrongly written character in the wrongly written character word dictionary searched out in sentence
Word TypoWord marks corresponding correct word CorrectWord labeled as wrong word;Simultaneously will be each in sentence
Correct word CorrectWord corresponding to wrongly written character word TypoWord is added in accurate participle word figure.
Preferably, the step 2) includes:
Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching
Row fuzzy matching;Calculate the similarity of the character string and corresponding scattered string in fuzzy matching;Judge whether similarity is not small
In threshold value tw, similar word of the character string in the fuzzy matching of threshold value as corresponding scattered string is not less than to similarity,
And be added to as fuzzy matching node in accurate participle word figure and form fuzzy participle word figure, until the character quilt in sentence
Travel through;
The wherein described character string W calculated in fuzzy matching2With corresponding scattered string W1Similarity be:
Wherein:Sim(W1, W2) it is to dissipate string W1With character string W2Similarity;Dissipate string W1=c1c2…cn, character string W2=
d1d2…dm, n and m represent W respectively1And W2In number of characters;Max () represents maximizing;editdis(W1, W2) it is two words
Accord with the distance function of string:
Wherein:sim(c1,d1) it is Chinese character c1With d1Similarity, calculated by below equation:
Wherein:sim(ci,dj) it is Chinese character ciWith Chinese character djSimilarity, 1≤i≤n, 1≤j≤m, PSim (ci,dj) it is the Chinese
Word ciWith Chinese character djPinyin similarity, SSim (ci,dj) it is Chinese character ciWith Chinese character djShape similarity, α and β represent phonetic respectively
The weight of similarity and shape similarity, alpha+beta=1.
Preferably, above-mentioned Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, described
Individual character, which replaces with, to be replaced based on the similar individual character of shape and/or is replaced based on the similar individual character of sound.
Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method or phonitic entry method from
Dynamic proofreading method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.
As preferable, the Chinese non-multi-character word error auto-collation for identifying error correction for OCR, the phonetic
Weight α=0 of similarity, weight beta=1 of shape similarity.
Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method and character-shape input method from
Dynamic proofreading method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.
Preferably, the step 3) comprises the following steps:
Step 31) be based on step 1) carry out accurately participle to sentence and step 2) fuzzy matching is carried out to sentence after obtain
Fuzzy participle word figure, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), uses
Binary model calculates the probability of every kind of cutting sequence:
Wherein G is that a certain bar in word figure segments path, GkFor k-th of word in path, s is for segmenting word in path
Number;γ(Gk-1, G ') and represent to be the penalty value for dissipating string and giving corresponding with fuzzy matching node to former string in sentence participle process,
γ (the G when current word is Precise Segmentationk-1, G ')=1, otherwise γ (Gk-1, G ') and=sim (Gk-1, G '), i.e., fuzzy in sentence
The former string G' matched somebody with somebody the and word G matchedk-1Similarity, the character string G also referred to as in fuzzy matchingk-1With corresponding scattered string
G' similarity;
The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure
Footpath, so as to obtain final cutting result;
For step 33) to the fuzzy matching node in shortest path, it is the word containing wrong word to mark former string corresponding to it, and
And the similar word that fuzzy matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.
Preferably, above-mentioned threshold value twFor 0.95.
Beneficial effect:The present invention proposes a kind of non-multi-character word error auto-collation based on fuzzy participle.The party
Method effectively can be identified and proofread to " non-multi-character word error " in Chinese language text during participle, and use
Method based on even numbers group Trie trees can be rapidly performed by fuzzy participle.Experiment shows, fuzzy participle provided by the invention it is " non-
The method recall rate of multi-character words mistake " automatic Proofreading reaches 75.9%, and precision reaches 85%, and for correction rate up to 62%, error correction is accurate
Rate is up to 81.7%.Faster system response, precision meet practical application request, and validity and accuracy are high, have higher practicality.
Embodiment
The present invention is further described with reference to the accompanying drawings and examples.
A kind of non-multi-character word error auto-collation based on fuzzy participle provided by the invention, based on fuzzy participle
Method carries out automatic Proofreading, comprises the following steps:
1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party
Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary
Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure
In.Specially:
Accurately segmented first with correct word dictionary and wrongly written character word dictionary, establish accurate participle word figure, wherein:
S:Sentence to be slit;Dic1:Correct word dictionary, Dic2:Wrongly written character word dictionary, po1:Correct dictionary lookup position;
pos2:Wrongly written character word dictionary lookup position.
Step 11) establishes correct word dictionary Dic1 even numbers group Trie tree constructions DicTrie;
Step 12) establishes wrongly written character word dictionary Dic2 even numbers group Trie tree constructions TypoDicTrie:(TypoWord,
CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word;Such as (for no reason at all
It is gratuitous without Gu);
Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described
Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure, as shown in figure 1, this reality
Apply example and represent Precise Segmentation with solid box in word figure;
It is in the present embodiment:Using correct dictionary Dic1 before pos1 (being initially set to 0) position to maximum search, it is assumed that
Correct word entry word1 is searched out, is added into accurate participle word figure, pos1 is updated to the position after word1;Otherwise pos1
Point to next word of current location;Repeat search goes to sentence S end until pos1;Step 14) is based on wrongly written character word word
The even numbers group Trie tree construction TypoDicTrie of allusion quotation, Precise Segmentation is carried out to the Chinese sentence using maximum matching process, and
Sentence is marked:Word by the wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence labeled as mistake, and
Mark corresponding correct word CorrectWord;Simultaneously by correct word corresponding to each wrongly written character word TypoWord in sentence
CorrectWord is added in accurate participle word figure, as shown in figure 1, the present embodiment is indicated by the dashed box in word figure.
It is in the present embodiment:Using wrong dictionary Dic2 before pos2 (being initially set to 0) position to maximum search, if searching
Rope error words TypoWord, correct entry CorrectWord corresponding to it is added and accurately segments word figure, and in sentence
Wrongly written character word and its corresponding correct word are marked, and pos2 is updated to the position after TypoWord;Otherwise pos2 points to current
Next word of position;Repeat search goes to sentence S end until pos1.
Citing, sentence S=" why you often take off my expense living without reason without original ".
By above-mentioned steps 13) after accurate participle, as a result as shown in figure 1, " you ", " why ", " frequent ", "None",
" original ", " without reason ", " button ", " taking ", " I ", " ", " work ", " expense " be Precise Segmentation result, solid box table is used in word figure
Show;
By above-mentioned steps 14) after accurate participle, as a result as shown in figure 1, wherein because (no former without reason, gratuitous) is
Word in wrongly written character word dictionary, after being segmented using it, "None", " original ", " without reason " replace after be " gratuitous ", in word figure
It is indicated by the dashed box.
2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string
Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed
Fuzzy participle word figure.Specifically include:
Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching
Row fuzzy matching, the Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, the individual character
Replace with and replaced based on the similar individual character of shape and/or replaced based on the similar individual character of sound;Calculated by Chinese string similarity formula
The similarity of character string and corresponding scattered string in fuzzy matching;Judge whether similarity is not less than threshold value tw, to similar
Degree is not less than similar word of the character string as corresponding scattered string in the fuzzy matching of threshold value, and as fuzzy
It is added to node in accurate participle word figure and forms fuzzy participle word figure, until the character in sentence has been traversed;Above by
Chinese string similarity formula calculates the character string W in fuzzy matching2With corresponding scattered string W1Similarity be:
Wherein:Sim(W1, W2) it is to dissipate string W1With character string W2Similarity;Dissipate string W1=c1c2…cn, character string W2=
d1d2…dm, n and m represent W respectively1And W2In number of characters;Max () represents maximizing;editdis(W1, W2) it is two words
Accord with the distance function of string:
Wherein:sim(c1,d1) it is Chinese character c1With d1Similarity, calculated by below equation:
Wherein:sim(ci,dj) it is Chinese character ciWith Chinese character djSimilarity, 1≤i≤n, 1≤j≤m, PSim (ci,dj) it is the Chinese
Word ciWith Chinese character djPinyin similarity, SSim (ci,dj) it is Chinese character ciWith Chinese character djShape similarity, α and β represent phonetic respectively
The weight of similarity and shape similarity, alpha+beta=1.
For being spelling input method or the Chinese non-multi-character word error automatic Proofreading of phonitic entry method for user's input method
Method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.
For the Chinese non-multi-character word error auto-collation for OCR identification error correction, the power of the pinyin similarity
Weight α=0, weight beta=1 of shape similarity.
For being spelling input method and the Chinese non-multi-character word error automatic Proofreading of character-shape input method for user's input method
Method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.
Specifically in the present embodiment, realized by following steps:
Step 20) gives the position nCurr=0 of the starting matching of Chinese sentence;
Step 21) therefrom sentence current location nCurr, read in current character, to current character carry out fuzzy matching;
During fuzzy, it is (similar or sound is similar replaces by the shape of word that the word of current location can be that individual character is replaced
Change), can also be multiword or scarce word to calculate similarity;
Step 22) calculates the similarity of two character strings, i.e., fuzzy matching in sentence using Chinese string similarity formula
Original string and the similarity of the word matched, the similarity of character string and corresponding scattered string alternatively referred to as in fuzzy matching,
Such as in accompanying drawing 1:
" no original " obtains similar Chinese character " edge " etc. by the pinyin similarity to " original " and shape Similarity Measure, utilizes Chinese
String calculating formula of similarity (1), calculate the similarity of Chinese string " no original " and the word " having no chance " in Chinese dictionary.In the present embodiment
User's input method is spelling input method and character-shape input method, therefore sets α=β=0.5;
If step 23) similarity is less than threshold value tw, then nCurr=nCurr+1, into step 21), otherwise into step
24);Because the degree of aliasing of Chinese character is very high, in the present embodiment, the threshold value twFor 0.95, naturally it is also possible to according to reality
Using being adjusted, such as 0.90,0.92,0.98;
Then similarity is not less than threshold value t to step 24)w, obtain one group of similar word and similarity (sFuzzyWord, next,
Sim), sFuzzyWord is the word matched, and next is next node location (next=that read in and carry out fuzzy matching
NCur+1), sim is similarity, and the former string of the position to be terminated since original position nCurr to matching enters with sFuzzyWord
Row calculates Similarity Measure and obtained;If next positions are the length of sentence, terminate, otherwise update nCurr and wanted to be next
The position next of reading, rebound step 21);
The similarity of fuzzy matching is not less than threshold value t by step 25)wSimilar word, as fuzzy matching node add
To accurate participle word figure, fuzzy participle word figure is formed;As shown in figure 1, the present embodiment is indicated by the dashed box in word figure.
In the example that the present embodiment Fig. 1 is provided, string "None" is dissipated, " original " is found in dictionary by the similar fuzzy matching of sound
Word " has no chance ", and scattered string " work ", " expense " find " telephone expenses ", " cost of living " in dictionary by the scarce word fuzzy matching of shape phase Sihe, will
The node of these fuzzy matching is added in word figure, is indicated by the dashed box in word figure.
3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most
Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non-
Multi-character words mistake automatic Proofreading.Specifically include:
Step 31) be based on step 1) carry out accurately participle to sentence and step 2) fuzzy matching is carried out to sentence after obtain
Fuzzy participle word figure, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), uses
Binary model calculates the probability of every kind of cutting sequence:
The present invention calculates the probability after cutting using the binary model of the word with reference to similarity, the knot to obscuring cutting
Fruit, plus certain punishment:Wherein G is that a certain bar in word figure segments path, GkFor k-th of word in path, s is participle road
The number of word in footpath;γ(Gk-1, G ') represent to give former string in sentence participle process for the string that dissipates corresponding with fuzzy matching node
Penalty value, if current word is Precise Segmentation, γ (Gk-1, G ')=1, otherwise γ (Gk-1, G ') and=sim (Gk-1, G '), i.e. sentence
The former string G' of the fuzzy matching and word G matched in sonk-1Similarity, the character string G alternatively referred to as in fuzzy matchingk-1With with
String G' similarity is dissipated corresponding to it;
The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure
Footpath, so as to obtain final cutting result;
For step 33) to the fuzzy matching node in shortest path, it is the word containing wrong word to mark former string corresponding to it, and
And the similar word that fuzzy matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.
As Fig. 1 the present embodiment provided example in, by accurate participle and the word figure of fuzzy participle generation, using combination
The binary model of similarity carries out solving the shortest path to the figure, obtains path:Path=" S ", " you ", " frequent ", " for
What ", " gratuitous ", " button ", " taking ", " I ", " ", " telephone expenses " maximum probability, be figure shortest path, its Road
Dotted line frame node " gratuitous " in footpath, the node that " telephone expenses " are fuzzy matching, then the former string in former sentence " no former without reason ",
Wrong word is included in " work takes ", compared with the correct word of fuzzy matching " gratuitous ", " telephone expenses ", " original ", " work " are in sentence
Wrong word, " no former without reason ", " expense living " are non-multi-character word error.
4th, test
Live through repeatedly open test, experiment using 20,000 row sentences testing material, wherein including non-multiword at 664
Word mistake, wherein non-multi-character word error include malapropism replaced type non-multi-character word error, word insert type non-multi-character word error and word
Deletion type non-multi-character word error.Test result indicates that non-multi-character word error identification recall rate provided by the invention reaches 75.9%,
Precision is 85%, and correction rate reaches 62%, and error correction accuracy rate is 81.7%, and this precision has exceeded prior art, has reached reality
The demand of border application, has higher validity and accuracy.
Above implementation column is only presently preferred embodiments of the present invention, does not form restriction to the present invention, relevant staff is not
In the range of deviateing the technology of the present invention thought, any modification, equivalent substitution and improvements carried out etc., guarantor of the invention is all fallen within
In the range of shield.