CN104991889B - A kind of non-multi-character word error auto-collation based on fuzzy participle - Google Patents

A kind of non-multi-character word error auto-collation based on fuzzy participle Download PDF

Info

Publication number
CN104991889B
CN104991889B CN201510361877.8A CN201510361877A CN104991889B CN 104991889 B CN104991889 B CN 104991889B CN 201510361877 A CN201510361877 A CN 201510361877A CN 104991889 B CN104991889 B CN 104991889B
Authority
CN
China
Prior art keywords
word
mrow
msub
character
fuzzy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510361877.8A
Other languages
Chinese (zh)
Other versions
CN104991889A (en
Inventor
刘亮亮
吴健康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201510361877.8A priority Critical patent/CN104991889B/en
Publication of CN104991889A publication Critical patent/CN104991889A/en
Application granted granted Critical
Publication of CN104991889B publication Critical patent/CN104991889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of non-multi-character word error auto-collation based on fuzzy participle, this method is based on correct word dictionary and carries out Precise Segmentation with wrongly written character word dictionary, generate word figure, then the similarity of Chinese word string is calculated using fuzzy matching algorithm, fuzzy matching is carried out to the scattered string of Precise Segmentation, the result of fuzzy matching is added in word figure, form fuzzy word figure, the shortest path of fuzzy word figure finally is calculated using the binary model for the word for combining similarity, so as to realize the automatic Proofreading of Chinese non-multi-character word error.Non-multi-character word error auto-collation provided by the invention based on fuzzy participle, faster system response, precision meet practical application request, and validity and accuracy are high.

Description

A kind of non-multi-character word error auto-collation based on fuzzy participle
Technical field
The present invention relates to the natural language processing in artificial intelligence computer field, more particularly to automatic proofreading for Chinese texts Field.
Background technology
With the information processing technology and the high speed development of internet, traditional text work almost all is taken by computer The e-text such as generation, e-book, electronic newspaper, Email, office document, blog, microblogging etc. all turn into people's daily life A part, but in text mistake it is also more and more, this brings very big challenge to proof-reading.Traditional artificial school It is low to efficiency, intensity is big, the cycle length obviously can not meet the needs of text proofreading.
Text automatic Proofreading is one of main application of natural language processing, and the problem of natural language understanding.With The development of technology, English text automatic Proofreading obtain extraordinary effect, have been commercialized.Compared to English, Chinese language text from Dynamic check and correction has following problem:
1) Chinese text check and correction, can be by looking into without " non-word mistake " --- the word not in dictionary similar to English Dictionary finds mistake;Chinese character in Chinese text can be all appeared in dictionary.
2) Chinese text check and correction first has to carry out Chinese word segmentation, if there is wrong word in a word, when participle Individual character can be divided into and dissipate string --- non-multi-character word error, this error-checking method to Chinese text bring difficulty.
3) occurring the scattered string of individual character in Chinese not necessarily has wrong word, because the ability of Chinese individual character into word is very strong;
4) in addition to non-multi-character word error, the word in another dictionary often a word is wrongly write into Chinese, it is this Mistake is referred to as true word mistake, and this is also the difficult point of automatic proofreading for Chinese texts;
For above-mentioned Railway Project, the present invention proposes and realizes the automatic errordetecting of Chinese non-multi-character word error and automatic Proofreading method.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention provides a kind of based on the non-of fuzzy participle Multi-character words mistake auto-collation.
Technical scheme:In order to solve the above technical problems, the present invention provides a kind of non-multi-character word error based on fuzzy participle Auto-collation, this method carry out automatic Proofreading by the method for fuzzy participle, comprised the following steps:
1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure In;
2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed Fuzzy participle word figure;
3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non- Multi-character words mistake automatic Proofreading.
Preferably, the step 1) comprises the following steps:
Step 11) establishes the even numbers group Trie tree constructions DicTrie of correct word dictionary;
Step 12) establishes the even numbers group Trie tree constructions TypoDicTrie of wrongly written character word dictionary:(TypoWord, CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word;
Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure;
Even numbers group Trie tree construction TypoDicTrie of the step 14) based on wrongly written character word dictionary, using maximum matching process pair The Chinese sentence carries out Precise Segmentation, and sentence is marked:By the wrongly written character in the wrongly written character word dictionary searched out in sentence Word TypoWord marks corresponding correct word CorrectWord labeled as wrong word;Simultaneously will be each in sentence Correct word CorrectWord corresponding to wrongly written character word TypoWord is added in accurate participle word figure.
Preferably, the step 2) includes:
Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching Row fuzzy matching;Calculate the similarity of the character string and corresponding scattered string in fuzzy matching;Judge whether similarity is not small In threshold value tw, similar word of the character string in the fuzzy matching of threshold value as corresponding scattered string is not less than to similarity, And be added to as fuzzy matching node in accurate participle word figure and form fuzzy participle word figure, until the character quilt in sentence Travel through;
The wherein described character string W calculated in fuzzy matching2With corresponding scattered string W1Similarity be:
Wherein:Sim(W1, W2) it is to dissipate string W1With character string W2Similarity;Dissipate string W1=c1c2…cn, character string W2= d1d2…dm, n and m represent W respectively1And W2In number of characters;Max () represents maximizing;editdis(W1, W2) it is two words Accord with the distance function of string:
Wherein:sim(c1,d1) it is Chinese character c1With d1Similarity, calculated by below equation:
Wherein:sim(ci,dj) it is Chinese character ciWith Chinese character djSimilarity, 1≤i≤n, 1≤j≤m, PSim (ci,dj) it is the Chinese Word ciWith Chinese character djPinyin similarity, SSim (ci,dj) it is Chinese character ciWith Chinese character djShape similarity, α and β represent phonetic respectively The weight of similarity and shape similarity, alpha+beta=1.
Preferably, above-mentioned Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, described Individual character, which replaces with, to be replaced based on the similar individual character of shape and/or is replaced based on the similar individual character of sound.
Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method or phonitic entry method from Dynamic proofreading method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.
As preferable, the Chinese non-multi-character word error auto-collation for identifying error correction for OCR, the phonetic Weight α=0 of similarity, weight beta=1 of shape similarity.
Preferably, for for Chinese non-multi-character word error that user's input method is spelling input method and character-shape input method from Dynamic proofreading method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.
Preferably, the step 3) comprises the following steps:
Step 31) be based on step 1) carry out accurately participle to sentence and step 2) fuzzy matching is carried out to sentence after obtain Fuzzy participle word figure, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), uses Binary model calculates the probability of every kind of cutting sequence:
Wherein G is that a certain bar in word figure segments path, GkFor k-th of word in path, s is for segmenting word in path Number;γ(Gk-1, G ') and represent to be the penalty value for dissipating string and giving corresponding with fuzzy matching node to former string in sentence participle process, γ (the G when current word is Precise Segmentationk-1, G ')=1, otherwise γ (Gk-1, G ') and=sim (Gk-1, G '), i.e., fuzzy in sentence The former string G' matched somebody with somebody the and word G matchedk-1Similarity, the character string G also referred to as in fuzzy matchingk-1With corresponding scattered string G' similarity;
The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure Footpath, so as to obtain final cutting result;
For step 33) to the fuzzy matching node in shortest path, it is the word containing wrong word to mark former string corresponding to it, and And the similar word that fuzzy matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.
Preferably, above-mentioned threshold value twFor 0.95.
Beneficial effect:The present invention proposes a kind of non-multi-character word error auto-collation based on fuzzy participle.The party Method effectively can be identified and proofread to " non-multi-character word error " in Chinese language text during participle, and use Method based on even numbers group Trie trees can be rapidly performed by fuzzy participle.Experiment shows, fuzzy participle provided by the invention it is " non- The method recall rate of multi-character words mistake " automatic Proofreading reaches 75.9%, and precision reaches 85%, and for correction rate up to 62%, error correction is accurate Rate is up to 81.7%.Faster system response, precision meet practical application request, and validity and accuracy are high, have higher practicality.
Brief description of the drawings
Fuzzy segmenting word illustrated example provided by the invention Fig. 1.
Embodiment
The present invention is further described with reference to the accompanying drawings and examples.
A kind of non-multi-character word error auto-collation based on fuzzy participle provided by the invention, based on fuzzy participle Method carries out automatic Proofreading, comprises the following steps:
1) using the even numbers group Tire tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum match party Method centering sentence carries out Precise Segmentation, establishes accurate participle word figure, and to carrying out the knot of Precise Segmentation based on wrongly written character word dictionary Fruit is marked, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added into word figure In.Specially:
Accurately segmented first with correct word dictionary and wrongly written character word dictionary, establish accurate participle word figure, wherein:
S:Sentence to be slit;Dic1:Correct word dictionary, Dic2:Wrongly written character word dictionary, po1:Correct dictionary lookup position; pos2:Wrongly written character word dictionary lookup position.
Step 11) establishes correct word dictionary Dic1 even numbers group Trie tree constructions DicTrie;
Step 12) establishes wrongly written character word dictionary Dic2 even numbers group Trie tree constructions TypoDicTrie:(TypoWord, CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word;Such as (for no reason at all It is gratuitous without Gu);
Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure, as shown in figure 1, this reality Apply example and represent Precise Segmentation with solid box in word figure;
It is in the present embodiment:Using correct dictionary Dic1 before pos1 (being initially set to 0) position to maximum search, it is assumed that Correct word entry word1 is searched out, is added into accurate participle word figure, pos1 is updated to the position after word1;Otherwise pos1 Point to next word of current location;Repeat search goes to sentence S end until pos1;Step 14) is based on wrongly written character word word The even numbers group Trie tree construction TypoDicTrie of allusion quotation, Precise Segmentation is carried out to the Chinese sentence using maximum matching process, and Sentence is marked:Word by the wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence labeled as mistake, and Mark corresponding correct word CorrectWord;Simultaneously by correct word corresponding to each wrongly written character word TypoWord in sentence CorrectWord is added in accurate participle word figure, as shown in figure 1, the present embodiment is indicated by the dashed box in word figure.
It is in the present embodiment:Using wrong dictionary Dic2 before pos2 (being initially set to 0) position to maximum search, if searching Rope error words TypoWord, correct entry CorrectWord corresponding to it is added and accurately segments word figure, and in sentence Wrongly written character word and its corresponding correct word are marked, and pos2 is updated to the position after TypoWord;Otherwise pos2 points to current Next word of position;Repeat search goes to sentence S end until pos1.
Citing, sentence S=" why you often take off my expense living without reason without original ".
By above-mentioned steps 13) after accurate participle, as a result as shown in figure 1, " you ", " why ", " frequent ", "None", " original ", " without reason ", " button ", " taking ", " I ", " ", " work ", " expense " be Precise Segmentation result, solid box table is used in word figure Show;
By above-mentioned steps 14) after accurate participle, as a result as shown in figure 1, wherein because (no former without reason, gratuitous) is Word in wrongly written character word dictionary, after being segmented using it, "None", " original ", " without reason " replace after be " gratuitous ", in word figure It is indicated by the dashed box.
2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained with dissipating string Corresponding similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed Fuzzy participle word figure.Specifically include:
Character traversal through in the Chinese sentence after step 1) accurately participle, is entered to each character using Method of Fuzzy Matching Row fuzzy matching, the Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replaces to carry out, the individual character Replace with and replaced based on the similar individual character of shape and/or replaced based on the similar individual character of sound;Calculated by Chinese string similarity formula The similarity of character string and corresponding scattered string in fuzzy matching;Judge whether similarity is not less than threshold value tw, to similar Degree is not less than similar word of the character string as corresponding scattered string in the fuzzy matching of threshold value, and as fuzzy It is added to node in accurate participle word figure and forms fuzzy participle word figure, until the character in sentence has been traversed;Above by Chinese string similarity formula calculates the character string W in fuzzy matching2With corresponding scattered string W1Similarity be:
Wherein:Sim(W1, W2) it is to dissipate string W1With character string W2Similarity;Dissipate string W1=c1c2…cn, character string W2= d1d2…dm, n and m represent W respectively1And W2In number of characters;Max () represents maximizing;editdis(W1, W2) it is two words Accord with the distance function of string:
Wherein:sim(c1,d1) it is Chinese character c1With d1Similarity, calculated by below equation:
Wherein:sim(ci,dj) it is Chinese character ciWith Chinese character djSimilarity, 1≤i≤n, 1≤j≤m, PSim (ci,dj) it is the Chinese Word ciWith Chinese character djPinyin similarity, SSim (ci,dj) it is Chinese character ciWith Chinese character djShape similarity, α and β represent phonetic respectively The weight of similarity and shape similarity, alpha+beta=1.
For being spelling input method or the Chinese non-multi-character word error automatic Proofreading of phonitic entry method for user's input method Method, weight α=1 of the pinyin similarity, weight beta=0 of shape similarity.
For the Chinese non-multi-character word error auto-collation for OCR identification error correction, the power of the pinyin similarity Weight α=0, weight beta=1 of shape similarity.
For being spelling input method and the Chinese non-multi-character word error automatic Proofreading of character-shape input method for user's input method Method, weight α=0.5 of the pinyin similarity, weight beta=0.5 of shape similarity.
Specifically in the present embodiment, realized by following steps:
Step 20) gives the position nCurr=0 of the starting matching of Chinese sentence;
Step 21) therefrom sentence current location nCurr, read in current character, to current character carry out fuzzy matching;
During fuzzy, it is (similar or sound is similar replaces by the shape of word that the word of current location can be that individual character is replaced Change), can also be multiword or scarce word to calculate similarity;
Step 22) calculates the similarity of two character strings, i.e., fuzzy matching in sentence using Chinese string similarity formula Original string and the similarity of the word matched, the similarity of character string and corresponding scattered string alternatively referred to as in fuzzy matching, Such as in accompanying drawing 1:
" no original " obtains similar Chinese character " edge " etc. by the pinyin similarity to " original " and shape Similarity Measure, utilizes Chinese String calculating formula of similarity (1), calculate the similarity of Chinese string " no original " and the word " having no chance " in Chinese dictionary.In the present embodiment User's input method is spelling input method and character-shape input method, therefore sets α=β=0.5;
If step 23) similarity is less than threshold value tw, then nCurr=nCurr+1, into step 21), otherwise into step 24);Because the degree of aliasing of Chinese character is very high, in the present embodiment, the threshold value twFor 0.95, naturally it is also possible to according to reality Using being adjusted, such as 0.90,0.92,0.98;
Then similarity is not less than threshold value t to step 24)w, obtain one group of similar word and similarity (sFuzzyWord, next, Sim), sFuzzyWord is the word matched, and next is next node location (next=that read in and carry out fuzzy matching NCur+1), sim is similarity, and the former string of the position to be terminated since original position nCurr to matching enters with sFuzzyWord Row calculates Similarity Measure and obtained;If next positions are the length of sentence, terminate, otherwise update nCurr and wanted to be next The position next of reading, rebound step 21);
The similarity of fuzzy matching is not less than threshold value t by step 25)wSimilar word, as fuzzy matching node add To accurate participle word figure, fuzzy participle word figure is formed;As shown in figure 1, the present embodiment is indicated by the dashed box in word figure.
In the example that the present embodiment Fig. 1 is provided, string "None" is dissipated, " original " is found in dictionary by the similar fuzzy matching of sound Word " has no chance ", and scattered string " work ", " expense " find " telephone expenses ", " cost of living " in dictionary by the scarce word fuzzy matching of shape phase Sihe, will The node of these fuzzy matching is added in word figure, is indicated by the dashed box in word figure.
3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, so as to obtain most Whole cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize that Chinese is non- Multi-character words mistake automatic Proofreading.Specifically include:
Step 31) be based on step 1) carry out accurately participle to sentence and step 2) fuzzy matching is carried out to sentence after obtain Fuzzy participle word figure, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), uses Binary model calculates the probability of every kind of cutting sequence:
The present invention calculates the probability after cutting using the binary model of the word with reference to similarity, the knot to obscuring cutting Fruit, plus certain punishment:Wherein G is that a certain bar in word figure segments path, GkFor k-th of word in path, s is participle road The number of word in footpath;γ(Gk-1, G ') represent to give former string in sentence participle process for the string that dissipates corresponding with fuzzy matching node Penalty value, if current word is Precise Segmentation, γ (Gk-1, G ')=1, otherwise γ (Gk-1, G ') and=sim (Gk-1, G '), i.e. sentence The former string G' of the fuzzy matching and word G matched in sonk-1Similarity, the character string G alternatively referred to as in fuzzy matchingk-1With with String G' similarity is dissipated corresponding to it;
The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure Footpath, so as to obtain final cutting result;
For step 33) to the fuzzy matching node in shortest path, it is the word containing wrong word to mark former string corresponding to it, and And the similar word that fuzzy matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.
As Fig. 1 the present embodiment provided example in, by accurate participle and the word figure of fuzzy participle generation, using combination The binary model of similarity carries out solving the shortest path to the figure, obtains path:Path=" S ", " you ", " frequent ", " for What ", " gratuitous ", " button ", " taking ", " I ", " ", " telephone expenses " maximum probability, be figure shortest path, its Road Dotted line frame node " gratuitous " in footpath, the node that " telephone expenses " are fuzzy matching, then the former string in former sentence " no former without reason ", Wrong word is included in " work takes ", compared with the correct word of fuzzy matching " gratuitous ", " telephone expenses ", " original ", " work " are in sentence Wrong word, " no former without reason ", " expense living " are non-multi-character word error.
4th, test
Live through repeatedly open test, experiment using 20,000 row sentences testing material, wherein including non-multiword at 664 Word mistake, wherein non-multi-character word error include malapropism replaced type non-multi-character word error, word insert type non-multi-character word error and word Deletion type non-multi-character word error.Test result indicates that non-multi-character word error identification recall rate provided by the invention reaches 75.9%, Precision is 85%, and correction rate reaches 62%, and error correction accuracy rate is 81.7%, and this precision has exceeded prior art, has reached reality The demand of border application, has higher validity and accuracy.
Above implementation column is only presently preferred embodiments of the present invention, does not form restriction to the present invention, relevant staff is not In the range of deviateing the technology of the present invention thought, any modification, equivalent substitution and improvements carried out etc., guarantor of the invention is all fallen within In the range of shield.

Claims (6)

1. a kind of non-multi-character word error auto-collation based on fuzzy participle, it is characterised in that pass through the method for fuzzy participle Automatic Proofreading is carried out, is comprised the following steps:
1) using the even numbers group Trie tree constructions established based on correct word dictionary and wrongly written character word dictionary, using maximum matching process pair Chinese sentence carries out Precise Segmentation, establishes accurate participle word figure, and the result to being carried out Precise Segmentation based on wrongly written character word dictionary is entered Line flag, while the Chinese sentence correct word corresponding with the wrongly written character word of wrongly written character word dictionary matching is added in word figure, wrap Include following steps:
Step 11) establishes the even numbers group Trie tree constructions DicTrie of correct word dictionary;
Step 12) establishes the even numbers group Trie tree constructions TypoDicTrie of wrongly written character word dictionary:(TypoWord, CorrectWord), wherein TypoWord is wrongly written character word, and CorrectWord is correct word corresponding to the wrongly written character word;
Even numbers group Trie tree construction DicTrie of the step 13) based on correct word dictionary, using maximum matching process to the Chinese Sentence carries out Precise Segmentation, and the word after cutting is added in word figure and establishes accurate participle word figure;
Even numbers group Trie tree construction TypoDicTrie of the step 14) based on wrongly written character word dictionary, using maximum matching process to described Chinese sentence carries out Precise Segmentation, and sentence is marked:By the wrongly written character word in the wrongly written character word dictionary searched out in sentence TypoWord marks corresponding correct word CorrectWord labeled as wrong word;Simultaneously by each mistake in sentence Correct word CorrectWord corresponding to words TypoWord is added in accurate participle word figure;
2) fuzzy matching is carried out to the scattered string in the word segmentation result of Precise Segmentation using Method of Fuzzy Matching, obtained corresponding with dissipating string Similar word and its similarity, obtained similar word corresponding with scattered string is added to accurate participle word figure, formed fuzzy Word figure is segmented, is specifically included:
Character traversal through in the Chinese sentence after step 1) accurately participle, mould is carried out using Method of Fuzzy Matching to each character Paste matching;Calculate the similarity of the character string and corresponding scattered string in fuzzy matching;Judge whether similarity is not less than threshold Value tw, to similarity not less than similar word of the character string in the fuzzy matching of threshold value as corresponding scattered string, and will It is added in accurate participle word figure as fuzzy matching node and forms fuzzy participle word figure, until the character in sentence is traversed It is complete;
The wherein described character string W calculated in fuzzy matching2With corresponding scattered string W1Similarity be:
<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>W</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mn>1</mn> <mo>-</mo> <mfrac> <mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>W</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein:Sim(W1, W2) it is to dissipate string W1With character string W2Similarity;Dissipate string W1=c1c2…cn, character string W2=d1d2… dm, n and m represent W respectively1And W2In number of characters;Max () represents maximizing;editdis(W1, W2) it is two character strings Distance function:
<mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>W</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>max</mi> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>c</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mn>1</mn> </msub> <mn>...</mn> <msub> <mi>d</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mn>...</mn> <msub> <mi>c</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>d</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>c</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>d</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>sin</mi> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>d</mi> <mn>1</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein:sim(c1,d1) it is Chinese character c1With d1Similarity, calculated by below equation:
Wherein:sim(ci,dj) it is Chinese character ciWith Chinese character djSimilarity, 1≤i≤n, 1≤j≤m, PSim (ci,dj) it is Chinese character ci With Chinese character djPinyin similarity, SSim (ci,dj) it is Chinese character ciWith Chinese character djShape similarity, α represents that phonetic is similar respectively with β The weight of degree and shape similarity, alpha+beta=1;
3) binary model based on the word for combining similarity, the shortest path of fuzzy participle word figure is calculated, it is final so as to obtain Cutting result, the mistake for marking former string corresponding to the fuzzy matching node in cutting result to find, to realize the non-multiword of Chinese Word mistake automatic Proofreading, comprises the following steps:
Step 31) is based on step 1), and to sentence progress, accurately participle and step 2) are fuzzy to being obtained after sentence progress fuzzy matching Word figure is segmented, obtains mulitpath, the similar word corresponding with scattered string and its similarity obtained with reference to step 2), using binary Model calculates the probability of every kind of cutting sequence:
<mrow> <mtable> <mtr> <mtd> <mrow> <msup> <mi>G</mi> <mo>*</mo> </msup> <mo>=</mo> <mi>a</mi> <mi>r</mi> <mi>g</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>G</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>G</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mi>arg</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>G</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>2</mn> </mrow> <mi>s</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>k</mi> </msub> <mo>|</mo> <msub> <mi>G</mi> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>*</mo> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msup> <mi>G</mi> <mo>&amp;prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein G is that a certain bar in word figure segments path, GkFor k-th of word in path, s is the number for segmenting word in path;γ (Gk-1, G ') and represent that to former string in sentence participle process be the penalty value for dissipating string and giving corresponding with fuzzy matching node, when current γ (G when word is Precise Segmentationk-1, G ')=1, otherwise γ (Gk-1, G ') and=sim (Gk-1, G '), i.e., the original of fuzzy matching in sentence The string G' and word G matchedk-1Similarity, the character string G also referred to as in fuzzy matchingk-1With corresponding scattered string G' phase Like degree;
The fuzzy participle word figure that step 32) obtains according to step 31), shortest path is solved using the dijkstra's algorithm of figure, from And obtain final cutting result;
Step 33) is to the fuzzy matching node in shortest path, and it is the word containing wrong word to mark former string corresponding to it, and mould The similar word that paste matching obtains is its corresponding correct word, it is achieved thereby that Chinese non-multi-character word error automatic Proofreading.
2. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that:Institute State that Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lacked word and replace and carries out, the individual character replaced with based on shape Similar individual character is replaced and/or replaced based on the similar individual character of sound.
3. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that:It is right In for the Chinese non-multi-character word error auto-collation that user's input method is spelling input method or phonitic entry method, the spelling Weight α=1 of sound similarity, weight beta=0 of shape similarity.
4. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that:It is right In the Chinese non-multi-character word error auto-collation for OCR identification error correction, weight α=0 of the pinyin similarity, shape phase Like weight beta=1 of degree.
5. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that:It is right In for the Chinese non-multi-character word error auto-collation that user's input method is spelling input method and character-shape input method, the spelling Weight α=0.5 of sound similarity, weight beta=0.5 of shape similarity.
6. the non-multi-character word error auto-collation according to claim 1 based on fuzzy participle, it is characterised in that institute State threshold value twFor 0.95.
CN201510361877.8A 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle Active CN104991889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510361877.8A CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510361877.8A CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Publications (2)

Publication Number Publication Date
CN104991889A CN104991889A (en) 2015-10-21
CN104991889B true CN104991889B (en) 2018-02-02

Family

ID=54303705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510361877.8A Active CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Country Status (1)

Country Link
CN (1) CN104991889B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573979B (en) * 2015-12-10 2018-05-22 江苏科技大学 A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN105512110B (en) * 2015-12-15 2018-04-06 江苏科技大学 A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN106610953A (en) * 2016-09-30 2017-05-03 四川用联信息技术有限公司 Method for solving text similarity based on Gini index
CN106598939B (en) * 2016-10-21 2019-09-17 北京三快在线科技有限公司 A kind of text error correction method and device, server, storage medium
CN106527757A (en) * 2016-10-28 2017-03-22 上海智臻智能网络科技股份有限公司 Input error correction method and apparatus
CN106528532B (en) * 2016-11-07 2019-03-12 上海智臻智能网络科技股份有限公司 Text error correction method, device and terminal
CN106547741B (en) * 2016-11-21 2019-02-15 江苏科技大学 A kind of Chinese language text auto-collation based on collocation
CN108572998A (en) * 2017-03-14 2018-09-25 北京橙鑫数据科技有限公司 A kind of data search method and device for electronic card data
CN108766437B (en) * 2018-05-31 2020-06-23 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN108717412A (en) * 2018-06-12 2018-10-30 北京览群智数据科技有限责任公司 Chinese check and correction error correction method based on Chinese word segmentation and system
CN109657738B (en) * 2018-10-25 2024-04-30 平安科技(深圳)有限公司 Character recognition method, device, equipment and storage medium
CN109492202B (en) * 2018-11-12 2022-12-27 浙江大学山东工业技术研究院 Chinese error correction method based on pinyin coding and decoding model
CN109558596A (en) * 2018-12-14 2019-04-02 平安城市建设科技(深圳)有限公司 Recognition methods, device, terminal and computer readable storage medium
CN110020005B (en) * 2019-03-28 2021-03-26 云知声(上海)智能科技有限公司 Method for matching main complaints in medical records with symptoms in current medical history
CN111209748B (en) * 2019-12-16 2023-10-24 合肥讯飞数码科技有限公司 Error word recognition method, related device and readable storage medium
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN112954387B (en) * 2021-01-26 2023-04-28 广州欢网科技有限责任公司 Method, system and readable storage medium for updating and optimizing television program list
CN114490260B (en) * 2022-01-20 2024-08-27 中国平安人寿保险股份有限公司 System index generation method, device, proxy server and storage medium
CN114091436B (en) * 2022-01-21 2022-05-17 万商云集(成都)科技股份有限公司 Sensitive word detection method based on decision tree and variant recognition
CN114781371A (en) * 2022-04-07 2022-07-22 山东新一代信息产业技术研究院有限公司 Chinese word segmentation method based on statistics and dictionary

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
双数组Trie树算法优化及其应用研究;王思力 等;《中文信息学报》;20061231;第20卷(第5期);全文 *
基于N-最短路径方法的中文词语粗分模型;张华平 等;《中文信息学报》;20021231;第16卷(第5期);第1-7页 *
基于快速模糊词匹配算法的中文自动校对方法;张磊 等;《Proceedings of the 3rd World Congress on Intelligent Control and Automation》;20001231;第2739-2743页 *
基于规则与统计相结合的中文文本自动查错模型与算法;张仰森 等;《中文信息学报》;20061231;第20卷(第4期);全文 *
汉字种子混淆集的构建方法研究;施恒利 等;《计算机科学》;20140831;第41卷(第8期);全文 *
汉字种子混淆集的构建方法研究;施恒利;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150430;第2015年卷(第4期);第16、27-28、35、38-39、41-43页、 *
领域问答系统中的文本错误自动发现方法;刘亮亮 等;《中文信息学报》;20130531;第27卷(第3期);全文 *

Also Published As

Publication number Publication date
CN104991889A (en) 2015-10-21

Similar Documents

Publication Publication Date Title
CN104991889B (en) A kind of non-multi-character word error auto-collation based on fuzzy participle
CN105045778B (en) A kind of Chinese homonym mistake auto-collation
CN112801010B (en) Visual rich document information extraction method for actual OCR scene
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
US7680646B2 (en) Retrieval method for translation memories containing highly structured documents
CN102930031B (en) By the method and system extracting bilingual parallel text in webpage
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN105068998A (en) Translation method and translation device based on neural network model
CN105808530B (en) Interpretation method and device in a kind of statistical machine translation
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
CN105512110B (en) A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN103473217B (en) The method and apparatus of extracting keywords from text
CN106547741B (en) A kind of Chinese language text auto-collation based on collocation
CN110866125A (en) Knowledge graph construction system based on bert algorithm model
CN106980620A (en) A kind of method and device matched to Chinese character string
CN113468339B (en) Label extraction method and system based on knowledge graph, electronic equipment and medium
CN105740235B (en) It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method
CN111221976A (en) Knowledge graph construction method based on bert algorithm model
CN110705261B (en) Chinese text word segmentation method and system thereof
CN104572618A (en) Question-answering system semantic-based similarity analyzing method, system and application
CN108763218A (en) A kind of video display retrieval entity recognition method based on CRF
CN112115701B (en) News reading text readability evaluation method and system
CN103714053A (en) Japanese verb identification method for machine translation
CN107894977A (en) With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20151021

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Denomination of invention: An automatic proofreading method for non multi word errors based on fuzzy segmentation

Granted publication date: 20180202

License type: Common License

Record date: 20201029

EE01 Entry into force of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Date of cancellation: 20201223

EC01 Cancellation of recordation of patent licensing contract
TR01 Transfer of patent right

Effective date of registration: 20221222

Address after: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee after: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Address before: 212003, No. 2, Mengxi Road, Zhenjiang, Jiangsu

Patentee before: JIANGSU University OF SCIENCE AND TECHNOLOGY

Effective date of registration: 20221222

Address after: Room 606-609, Compound Office Complex Building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou, Guangdong Province, 510699

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee before: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

TR01 Transfer of patent right