CN104991889B

CN104991889B - A kind of non-multi-character word error auto-collation based on fuzzy participle

Info

Publication number: CN104991889B
Application number: CN201510361877.8A
Authority: CN
Inventors: 刘亮亮; 吴健康
Original assignee: Jiangsu University of Science and Technology
Current assignee: China Southern Power Grid Internet Service Co ltd; Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2018-02-02
Anticipated expiration: 2035-06-26
Also published as: CN104991889A

Abstract

The invention discloses a kind of non-multi-character word error auto-collation based on fuzzy participle, this method is based on correct word dictionary and carries out Precise Segmentation with wrongly written character word dictionary, generate word figure, then the similarity of Chinese word string is calculated using fuzzy matching algorithm, fuzzy matching is carried out to the scattered string of Precise Segmentation, the result of fuzzy matching is added in word figure, form fuzzy word figure, the shortest path of fuzzy word figure finally is calculated using the binary model for the word for combining similarity, so as to realize the automatic Proofreading of Chinese non-multi-character word error.Non-multi-character word error auto-collation provided by the invention based on fuzzy participle, faster system response, precision meet practical application request, and validity and accuracy are high.

Description

A kind of non-multi-character word error auto-collation based on fuzzy participle

Technical field

The present invention relates to the natural language processing in artificial intelligence computer field, more particularly to automatic proofreading for Chinese texts Field.

Background technology

With the information processing technology and the high speed development of internet, traditional text work almost all is taken by computer The e-text such as generation, e-book, electronic newspaper, Email, office document, blog, microblogging etc. all turn into people's daily life A part, but in text mistake it is also more and more, this brings very big challenge to proof-reading.Traditional artificial school It is low to efficiency, intensity is big, the cycle length obviously can not meet the needs of text proofreading.

Text automatic Proofreading is one of main application of natural language processing, and the problem of natural language understanding.With The development of technology, English text automatic Proofreading obtain extraordinary effect, have been commercialized.Compared to English, Chinese language text from Dynamic check and correction has following problem：

1) Chinese text check and correction, can be by looking into without " non-word mistake " --- the word not in dictionary similar to English Dictionary finds mistake；Chinese character in Chinese text can be all appeared in dictionary.

2) Chinese text check and correction first has to carry out Chinese word segmentation, if there is wrong word in a word, when participle Individual character can be divided into and dissipate string --- non-multi-character word error, this error-checking method to Chinese text bring difficulty.

3) occurring the scattered string of individual character in Chinese not necessarily has wrong word, because the ability of Chinese individual character into word is very strong；

4) in addition to non-multi-character word error, the word in another dictionary often a word is wrongly write into Chinese, it is this Mistake is referred to as true word mistake, and this is also the difficult point of automatic proofreading for Chinese texts；

For above-mentioned Railway Project, the present invention proposes and realizes the automatic errordetecting of Chinese non-multi-character word error and automatic Proofreading method.

The content of the invention

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention provides a kind of based on the non-of fuzzy participle Multi-character words mistake auto-collation.

Technical scheme：In order to solve the above technical problems, the present invention provides a kind of non-multi-character word error based on fuzzy participle Auto-collation, this method carry out automatic Proofreading by the method for fuzzy participle, comprised the following steps：

1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure In；

2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed Fuzzy participle word figure；

3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non- Multi-character words mistake automatic Proofreading.

Preferably, the step 1) comprises the following steps：

Step 11) establishes the even numbers group Trie tree constructions DicTrie of correct word dictionary；

Step 12) establishes the even numbers group Trie tree constructions TypoDicTrie of wrongly written character word dictionary：(TypoWord, CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word；

Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure；

Even numbers group Trie tree construction TypoDicTrie of the step 14) based on wrongly written character word dictionary, using maximum matching process pair The Chinese sentence carries out Precise Segmentation, and sentence is marked：By the wrongly written character in the wrongly written character word dictionary searched out in sentence Word TypoWord marks corresponding correct word CorrectWord labeled as wrong word；Simultaneously will be each in sentence Correct word CorrectWord corresponding to wrongly written character word TypoWord is added in accurate participle word figure.

Preferably, the step 2) includes：

Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching Row fuzzy matching；Calculate the similarity of the character string and corresponding scattered string in fuzzy matching；Judge whether similarity is not small In threshold value t_w, similar word of the character string in the fuzzy matching of threshold value as corresponding scattered string is not less than to similarity, And be added to as fuzzy matching node in accurate participle word figure and form fuzzy participle word figure, until the character quilt in sentence Travel through；

The wherein described character string W calculated in fuzzy matching₂With corresponding scattered string W₁Similarity be：

Wherein：Sim(W₁, W₂) it is to dissipate string W₁With character string W₂Similarity；Dissipate string W₁=c₁c₂…c_n, character string W₂= d₁d₂…d_m, n and m represent W respectively₁And W₂In number of characters；Max () represents maximizing；editdis(W₁, W₂) it is two words Accord with the distance function of string：

Wherein：sim(c₁,d₁) it is Chinese character c₁With d₁Similarity, calculated by below equation：

Wherein：sim(c_i,d_j) it is Chinese character c_iWith Chinese character d_jSimilarity, 1≤i≤n, 1≤j≤m, PSim (c_i,d_j) it is the Chinese Word c_iWith Chinese character d_jPinyin similarity, SSim (c_i,d_j) it is Chinese character c_iWith Chinese character d_jShape similarity, α and β represent phonetic respectively The weight of similarity and shape similarity, alpha+beta=1.

Preferably, above-mentioned Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, described Individual character, which replaces with, to be replaced based on the similar individual character of shape and/or is replaced based on the similar individual character of sound.

Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method or phonitic entry method from Dynamic proofreading method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.

As preferable, the Chinese non-multi-character word error auto-collation for identifying error correction for OCR, the phonetic Weight α=0 of similarity, weight beta=1 of shape similarity.

Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method and character-shape input method from Dynamic proofreading method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.

Preferably, the step 3) comprises the following steps：

Step 31) be based on step 1) carry out accurately participle to sentence and step 2) fuzzy matching is carried out to sentence after obtain Fuzzy participle word figure, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), uses Binary model calculates the probability of every kind of cutting sequence：

Wherein G is that a certain bar in word figure segments path, G_kFor k-th of word in path, s is for segmenting word in path Number；γ(G_k-1, G ') and represent to be the penalty value for dissipating string and giving corresponding with fuzzy matching node to former string in sentence participle process, γ (the G when current word is Precise Segmentation_k-1, G ')=1, otherwise γ (G_k-1, G ') and=sim (G_k-1, G '), i.e., fuzzy in sentence The former string G' matched somebody with somebody the and word G matched_k-1Similarity, the character string G also referred to as in fuzzy matching_k-1With corresponding scattered string G' similarity；

The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure Footpath, so as to obtain final cutting result；

For step 33) to the fuzzy matching node in shortest path, it is the word containing wrong word to mark former string corresponding to it, and And the similar word that fuzzy matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.

Preferably, above-mentioned threshold value t_wFor 0.95.

Beneficial effect：The present invention proposes a kind of non-multi-character word error auto-collation based on fuzzy participle.The party Method effectively can be identified and proofread to " non-multi-character word error " in Chinese language text during participle, and use Method based on even numbers group Trie trees can be rapidly performed by fuzzy participle.Experiment shows, fuzzy participle provided by the invention it is " non- The method recall rate of multi-character words mistake " automatic Proofreading reaches 75.9%, and precision reaches 85%, and for correction rate up to 62%, error correction is accurate Rate is up to 81.7%.Faster system response, precision meet practical application request, and validity and accuracy are high, have higher practicality.

Brief description of the drawings

Fuzzy segmenting word illustrated example provided by the invention Fig. 1.

Embodiment

The present invention is further described with reference to the accompanying drawings and examples.

A kind of non-multi-character word error auto-collation based on fuzzy participle provided by the invention, based on fuzzy participle Method carries out automatic Proofreading, comprises the following steps：

1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure In.Specially：

Accurately segmented first with correct word dictionary and wrongly written character word dictionary, establish accurate participle word figure, wherein：

S：Sentence to be slit；Dic1:Correct word dictionary, Dic2:Wrongly written character word dictionary, po1:Correct dictionary lookup position； pos2：Wrongly written character word dictionary lookup position.

Step 11) establishes correct word dictionary Dic1 even numbers group Trie tree constructions DicTrie；

Step 12) establishes wrongly written character word dictionary Dic2 even numbers group Trie tree constructions TypoDicTrie：(TypoWord, CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word；Such as (for no reason at all It is gratuitous without Gu)；

Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure, as shown in figure 1, this reality Apply example and represent Precise Segmentation with solid box in word figure；

It is in the present embodiment：Using correct dictionary Dic1 before pos1 (being initially set to 0) position to maximum search, it is assumed that Correct word entry word1 is searched out, is added into accurate participle word figure, pos1 is updated to the position after word1；Otherwise pos1 Point to next word of current location；Repeat search goes to sentence S end until pos1；Step 14) is based on wrongly written character word word The even numbers group Trie tree construction TypoDicTrie of allusion quotation, Precise Segmentation is carried out to the Chinese sentence using maximum matching process, and Sentence is marked：Word by the wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence labeled as mistake, and Mark corresponding correct word CorrectWord；Simultaneously by correct word corresponding to each wrongly written character word TypoWord in sentence CorrectWord is added in accurate participle word figure, as shown in figure 1, the present embodiment is indicated by the dashed box in word figure.

It is in the present embodiment：Using wrong dictionary Dic2 before pos2 (being initially set to 0) position to maximum search, if searching Rope error words TypoWord, correct entry CorrectWord corresponding to it is added and accurately segments word figure, and in sentence Wrongly written character word and its corresponding correct word are marked, and pos2 is updated to the position after TypoWord；Otherwise pos2 points to current Next word of position；Repeat search goes to sentence S end until pos1.

Citing, sentence S=" why you often take off my expense living without reason without original ".

By above-mentioned steps 13) after accurate participle, as a result as shown in figure 1, " you ", " why ", " frequent ", "None", " original ", " without reason ", " button ", " taking ", " I ", " ", " work ", " expense " be Precise Segmentation result, solid box table is used in word figure Show；

By above-mentioned steps 14) after accurate participle, as a result as shown in figure 1, wherein because (no former without reason, gratuitous) is Word in wrongly written character word dictionary, after being segmented using it, "None", " original ", " without reason " replace after be " gratuitous ", in word figure It is indicated by the dashed box.

2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed Fuzzy participle word figure.Specifically include：

Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching Row fuzzy matching, the Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, the individual character Replace with and replaced based on the similar individual character of shape and/or replaced based on the similar individual character of sound；Calculated by Chinese string similarity formula The similarity of character string and corresponding scattered string in fuzzy matching；Judge whether similarity is not less than threshold value t_w, to similar Degree is not less than similar word of the character string as corresponding scattered string in the fuzzy matching of threshold value, and as fuzzy It is added to node in accurate participle word figure and forms fuzzy participle word figure, until the character in sentence has been traversed；Above by Chinese string similarity formula calculates the character string W in fuzzy matching₂With corresponding scattered string W₁Similarity be：

For being spelling input method or the Chinese non-multi-character word error automatic Proofreading of phonitic entry method for user's input method Method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.

For the Chinese non-multi-character word error auto-collation for OCR identification error correction, the power of the pinyin similarity Weight α=0, weight beta=1 of shape similarity.

For being spelling input method and the Chinese non-multi-character word error automatic Proofreading of character-shape input method for user's input method Method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.

Specifically in the present embodiment, realized by following steps：

Step 20) gives the position nCurr=0 of the starting matching of Chinese sentence；

Step 21) therefrom sentence current location nCurr, read in current character, to current character carry out fuzzy matching；

During fuzzy, it is (similar or sound is similar replaces by the shape of word that the word of current location can be that individual character is replaced Change), can also be multiword or scarce word to calculate similarity；

Step 22) calculates the similarity of two character strings, i.e., fuzzy matching in sentence using Chinese string similarity formula Original string and the similarity of the word matched, the similarity of character string and corresponding scattered string alternatively referred to as in fuzzy matching, Such as in accompanying drawing 1：

" no original " obtains similar Chinese character " edge " etc. by the pinyin similarity to " original " and shape Similarity Measure, utilizes Chinese String calculating formula of similarity (1), calculate the similarity of Chinese string " no original " and the word " having no chance " in Chinese dictionary.In the present embodiment User's input method is spelling input method and character-shape input method, therefore sets α=β=0.5；

If step 23) similarity is less than threshold value t_w, then nCurr=nCurr+1, into step 21), otherwise into step 24)；Because the degree of aliasing of Chinese character is very high, in the present embodiment, the threshold value t_wFor 0.95, naturally it is also possible to according to reality Using being adjusted, such as 0.90,0.92,0.98；

Then similarity is not less than threshold value t to step 24)_w, obtain one group of similar word and similarity (sFuzzyWord, next, Sim), sFuzzyWord is the word matched, and next is next node location (next=that read in and carry out fuzzy matching NCur+1), sim is similarity, and the former string of the position to be terminated since original position nCurr to matching enters with sFuzzyWord Row calculates Similarity Measure and obtained；If next positions are the length of sentence, terminate, otherwise update nCurr and wanted to be next The position next of reading, rebound step 21)；

The similarity of fuzzy matching is not less than threshold value t by step 25)_wSimilar word, as fuzzy matching node add To accurate participle word figure, fuzzy participle word figure is formed；As shown in figure 1, the present embodiment is indicated by the dashed box in word figure.

In the example that the present embodiment Fig. 1 is provided, string "None" is dissipated, " original " is found in dictionary by the similar fuzzy matching of sound Word " has no chance ", and scattered string " work ", " expense " find " telephone expenses ", " cost of living " in dictionary by the scarce word fuzzy matching of shape phase Sihe, will The node of these fuzzy matching is added in word figure, is indicated by the dashed box in word figure.

3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non- Multi-character words mistake automatic Proofreading.Specifically include：

The present invention calculates the probability after cutting using the binary model of the word with reference to similarity, the knot to obscuring cutting Fruit, plus certain punishment：Wherein G is that a certain bar in word figure segments path, G_kFor k-th of word in path, s is participle road The number of word in footpath；γ(G_k-1, G ') represent to give former string in sentence participle process for the string that dissipates corresponding with fuzzy matching node Penalty value, if current word is Precise Segmentation, γ (G_k-1, G ')=1, otherwise γ (G_k-1, G ') and=sim (G_k-1, G '), i.e. sentence The former string G' of the fuzzy matching and word G matched in son_k-1Similarity, the character string G alternatively referred to as in fuzzy matching_k-1With with String G' similarity is dissipated corresponding to it；

As Fig. 1 the present embodiment provided example in, by accurate participle and the word figure of fuzzy participle generation, using combination The binary model of similarity carries out solving the shortest path to the figure, obtains path：Path=" S ", " you ", " frequent ", " for What ", " gratuitous ", " button ", " taking ", " I ", " ", " telephone expenses " maximum probability, be figure shortest path, its Road Dotted line frame node " gratuitous " in footpath, the node that " telephone expenses " are fuzzy matching, then the former string in former sentence " no former without reason ", Wrong word is included in " work takes ", compared with the correct word of fuzzy matching " gratuitous ", " telephone expenses ", " original ", " work " are in sentence Wrong word, " no former without reason ", " expense living " are non-multi-character word error.

4th, test

Live through repeatedly open test, experiment using 20,000 row sentences testing material, wherein including non-multiword at 664 Word mistake, wherein non-multi-character word error include malapropism replaced type non-multi-character word error, word insert type non-multi-character word error and word Deletion type non-multi-character word error.Test result indicates that non-multi-character word error identification recall rate provided by the invention reaches 75.9%, Precision is 85%, and correction rate reaches 62%, and error correction accuracy rate is 81.7%, and this precision has exceeded prior art, has reached reality The demand of border application, has higher validity and accuracy.

Above implementation column is only presently preferred embodiments of the present invention, does not form restriction to the present invention, relevant staff is not In the range of deviateing the technology of the present invention thought, any modification, equivalent substitution and improvements carried out etc., guarantor of the invention is all fallen within In the range of shield.

Claims

1. a kind of non-multi-character word error auto-collation based on fuzzy participle, it is characterised in that pass through the method for fuzzy participle Automatic Proofreading is carried out, is comprised the following steps：

1) using the even numbers group Trie tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum matching process pair Chinese sentence carries out Precise Segmentation, establishes accurate participle word figure, and the result to being carried out Precise Segmentation based on wrongly written character word dictionary is entered Line flag, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added in word figure, wrap Include following steps：

Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to the Chinese Sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure；

Even numbers group Trie tree construction TypoDicTrie of the step 14) based on wrongly written character word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and sentence is marked：By the wrongly written character word in the wrongly written character word dictionary searched out in sentence TypoWord marks corresponding correct word CorrectWord labeled as wrong word；Simultaneously by each mistake in sentence Correct word CorrectWord corresponding to words TypoWord is added in accurate participle word figure；

2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained corresponding with dissipating string Similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed fuzzy Word figure is segmented, is specifically included：

Character traversal through in the Chinese sentence after step 1) accurately participle, mould is carried out using Method of Fuzzy Matching to each character Paste matching；Calculate the similarity of the character string and corresponding scattered string in fuzzy matching；Judge whether similarity is not less than threshold Value t_w, to similarity not less than similar word of the character string in the fuzzy matching of threshold value as corresponding scattered string, and will It is added in accurate participle word figure as fuzzy matching node and forms fuzzy participle word figure, until the character in sentence is traversed It is complete；

Wherein：Sim(W₁, W₂) it is to dissipate string W₁With character string W₂Similarity；Dissipate string W₁=c₁c₂…c_n, character string W₂=d₁d₂… d_m, n and m represent W respectively₁And W₂In number of characters；Max () represents maximizing；editdis(W₁, W₂) it is two character strings Distance function：

Wherein：sim(c_i,d_j) it is Chinese character c_iWith Chinese character d_jSimilarity, 1≤i≤n, 1≤j≤m, PSim (c_i,d_j) it is Chinese character c_i With Chinese character d_jPinyin similarity, SSim (c_i,d_j) it is Chinese character c_iWith Chinese character d_jShape similarity, α represents that phonetic is similar respectively with β The weight of degree and shape similarity, alpha+beta=1；

3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, it is final so as to obtain Cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize the non-multiword of Chinese Word mistake automatic Proofreading, comprises the following steps：

Step 31) is based on step 1), and to sentence progress, accurately participle and step 2) are fuzzy to being obtained after sentence progress fuzzy matching Word figure is segmented, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), using binary Model calculates the probability of every kind of cutting sequence：

<mrow> <mtable> <mtr> <mtd> <mrow> <msup> <mi>G</mi> <mo>*</mo> </msup> <mo>=</mo> <mi>a</mi> <mi>r</mi> <mi>g</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>G</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>G</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mi>arg</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>G</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>2</mn> </mrow> <mi>s</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>k</mi> </msub> <mo>|</mo> <msub> <mi>G</mi> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>*</mo> <mi>&gamma;</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msup> <mi>G</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Wherein G is that a certain bar in word figure segments path, G_kFor k-th of word in path, s is the number for segmenting word in path；γ (G_k-1, G ') and represent that to former string in sentence participle process be the penalty value for dissipating string and giving corresponding with fuzzy matching node, when current γ (G when word is Precise Segmentation_k-1, G ')=1, otherwise γ (G_k-1, G ') and=sim (G_k-1, G '), i.e., the original of fuzzy matching in sentence The string G' and word G matched_k-1Similarity, the character string G also referred to as in fuzzy matching_k-1With corresponding scattered string G' phase Like degree；

The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure, from And obtain final cutting result；

Step 33) is to the fuzzy matching node in shortest path, and it is the word containing wrong word to mark former string corresponding to it, and mould The similar word that paste matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.

2. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that：Institute State that Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replace and carries out, the individual character replaced with based on shape Similar individual character is replaced and/or replaced based on the similar individual character of sound.

3. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that：It is right In for the Chinese non-multi-character word error auto-collation that user's input method is spelling input method or phonitic entry method, the spelling Weight α=1 of sound similarity, weight beta=0 of shape similarity.

4. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that：It is right In the Chinese non-multi-character word error auto-collation for OCR identification error correction, weight α=0 of the pinyin similarity, shape phase Like weight beta=1 of degree.

5. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that：It is right In for the Chinese non-multi-character word error auto-collation that user's input method is spelling input method and character-shape input method, the spelling Weight α=0.5 of sound similarity, weight beta=0.5 of shape similarity.

6. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that institute State threshold value t_wFor 0.95.