CN104991889A - Fuzzy word segmentation based non-multi-character word error automatic proofreading method - Google Patents

Fuzzy word segmentation based non-multi-character word error automatic proofreading method Download PDF

Info

Publication number
CN104991889A
CN104991889A CN201510361877.8A CN201510361877A CN104991889A CN 104991889 A CN104991889 A CN 104991889A CN 201510361877 A CN201510361877 A CN 201510361877A CN 104991889 A CN104991889 A CN 104991889A
Authority
CN
China
Prior art keywords
word
character
fuzzy
participle
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510361877.8A
Other languages
Chinese (zh)
Other versions
CN104991889B (en
Inventor
刘亮亮
吴健康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201510361877.8A priority Critical patent/CN104991889B/en
Publication of CN104991889A publication Critical patent/CN104991889A/en
Application granted granted Critical
Publication of CN104991889B publication Critical patent/CN104991889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

Description

A kind of non-multi-character word error auto-collation based on fuzzy participle
Technical field
The present invention relates to the natural language processing in artificial intelligence computer field, particularly automatic proofreading for Chinese texts field.
Background technology
Along with the high speed development of the information processing technology and internet, traditional text work almost all replace by computing machine, the e-texts such as e-book, electronic newspaper, Email, office document, blog, microblogging etc. all become a part for people's daily life, but the mistake in text also gets more and more, this brings very large challenge to proof-reading.Traditional artificial correction efficiency is low, intensity is large, the cycle long demand that obviously can not meet text proofreading.
Text automatic Proofreading is one of main application of natural language processing, is also a difficult problem for natural language understanding.Along with the development of technology, English text automatic Proofreading obtains extraordinary effect, commercialization.Compare and English, Chinese language text automatic Proofreading has a following difficult problem:
1) Chinese text check and correction is not similar to English " non-word mistake "---the word not in dictionary, can find mistake by looking up the dictionary; Chinese character in Chinese text all can appear in dictionary.
2) first Chinese text check and correction will carry out Chinese word segmentation, if there is wrongly written or mispronounced characters in a word, can be divided into the loose string of individual character when participle---and non-multi-character word error, this brings difficulty to the error-checking method of Chinese text.
3) occur in Chinese that the individual character string that falls apart not necessarily has wrongly written or mispronounced characters, because Chinese individual character becomes the ability of word very strong;
4) except non-multi-character word error, often a word is wrongly write into the word in another one dictionary in Chinese, this mistake is called true word mistake, and this is also the difficult point of automatic proofreading for Chinese texts;
For above-mentioned Railway Project, the present invention proposes and achieves automatic errordetecting and the auto-collation of Chinese non-multi-character word error.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of non-multi-character word error auto-collation based on fuzzy participle.
Technical scheme: for solving the problems of the technologies described above, the invention provides a kind of non-multi-character word error auto-collation based on fuzzy participle, the method carries out automatic Proofreading by the method for fuzzy participle, comprises the following steps:
1) the even numbers group Tire tree construction set up based on correct word dictionary and wrongly written character word dictionary is utilized, maximum matching process centering sentence is adopted to carry out Precise Segmentation, set up accurate participle word figure, and the result of carrying out Precise Segmentation based on wrongly written character word dictionary is marked, the correct word that described Chinese sentence is corresponding with the wrongly written character word of wrongly written character word dictionary matching is joined in word figure simultaneously;
2) Method of Fuzzy Matching is adopted to carry out fuzzy matching to the loose string in the word segmentation result of Precise Segmentation, obtain and loose corresponding similar word and the similarity thereof of going here and there, go here and there corresponding similar word by what obtain join accurate participle word figure with loose, form fuzzy participle word figure;
3) based on the binary model of the word in conjunction with similarity, calculate the shortest path of fuzzy participle word figure, thus obtaining final cutting result, the former string that the fuzzy matching node in mark cutting result is corresponding is the mistake found, to realize Chinese non-multi-character word error automatic Proofreading.
Preferably, described step 1) comprise the following steps:
Step 11) set up the even numbers group Trie tree construction DicTrie of correct word dictionary;
Step 12) set up even numbers group Trie tree construction TypoDicTrie:(TypoWord, the CorrectWord of wrongly written character word dictionary), wherein TypoWord is wrongly written character word, and CorrectWord is the correct word that this wrongly written character word is corresponding;
Step 13) based on the even numbers group Trie tree construction DicTrie of correct word dictionary, adopt maximum matching process to carry out Precise Segmentation to described Chinese sentence, the word after cutting is joined in word figure and set up accurate participle word figure;
Step 14) based on the even numbers group Trie tree construction TypoDicTrie of wrongly written character word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, and sentence is marked: the word wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence being labeled as mistake, and mark the correct word CorrectWord corresponding with it; Correct word CorrectWord corresponding for each wrong word word TypoWord in sentence is joined in accurate participle word figure simultaneously.
Preferably, described step 2) comprising:
Traversal is by step 1) character in Chinese sentence accurately after participle, adopt Method of Fuzzy Matching to carry out fuzzy matching to each character; Character string in calculating fuzzy matching and the similarity of the loose string corresponding with it; Judge whether similarity is not less than threshold value t w, the similar word of the character string in the fuzzy matching of threshold value as the loose string corresponding with it is not less than to similarity, and it can be used as fuzzy matching node to join in accurate participle word figure to form fuzzy participle word figure, until the character in sentence has been traversed;
Character string W in wherein said calculating fuzzy matching 2and the loose string W corresponding with it 1similarity be:
Wherein: Chinese string W 1=c 1c 2c n, W 2=d 1d 2d m, editdis (W 1, W 2) be the distance function of two character strings:
Wherein: sim (c i, d i) be Chinese character c iwith d isimilarity:
Wherein: PSim (c i, d i) be Chinese character c iwith Chinese character d ipinyin similarity, SSim (c i, d i) be Chinese character c iwith Chinese character d ishape similarity, α and β represents the weight of pinyin similarity and shape similarity respectively, alpha+beta=1.
Preferably, above-mentioned Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, and described individual character replaces with to be replaced based on the individual character that shape is similar and/or replace based on the individual character that sound is similar.
Preferably, for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.
As preferably, for the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.
Preferably, for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.
Preferably, described step 3) comprise the following steps:
Step 31) based on step 1) accurate participle and step 2 are carried out to sentence) and fuzzy matching is carried out to sentence after the fuzzy participle word figure that obtains, obtain mulitpath, integrating step 2) that obtain with loose corresponding similar word and the similarity thereof of going here and there, adopt binary model to calculate the probability of often kind of cutting sequence:
Wherein W is a certain bar participle path in word figure, W ifor the word of i-th in path, n is the number of word in participle path; α (W i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, the α (W when current word is Precise Segmentation i-1, W ')=1, otherwise α (W i-1, W ') and=sim (W i-1, W '), i.e. the former string W' of fuzzy matching in sentence and the word W matched i-1similarity, also referred to as the character string W in fuzzy matching i-1and the similarity of the loose string W' corresponding with it;
Step 32) according to step 31) the fuzzy participle word figure that obtains, utilize the dijkstra's algorithm of figure to solve shortest path, thus obtain final cutting result;
Step 33) to the fuzzy matching node in shortest path, the former string marking its correspondence is the word containing wrongly written or mispronounced characters, and the similar word that fuzzy matching obtains is its corresponding correct word, thus achieve Chinese non-multi-character word error automatic Proofreading.
Preferably, above-mentioned threshold value t wbe 0.95.
Beneficial effect: the present invention proposes a kind of non-multi-character word error auto-collation based on fuzzy participle.The method can effectively identify " non-multi-character word error " in Chinese language text and proofread in the process of participle, and the method based on even numbers group Trie tree adopted can carry out fuzzy participle fast.Experiment shows, the method recall rate of " non-multi-character word error " automatic Proofreading of fuzzy participle provided by the invention reaches 75.9%, and precision reaches 85%, and correction rate reaches 62%, error correction rate of accuracy reached 81.7%.The realistic application demand of faster system response, precision, validity and accuracy high, there is higher practicality.
Accompanying drawing explanation
The fuzzy segmenting word illustrated example that Fig. 1 is provided by the invention.
Embodiment
Below in conjunction with drawings and Examples, the present invention is further described.
A kind of non-multi-character word error auto-collation based on fuzzy participle provided by the invention, the method based on fuzzy participle carries out automatic Proofreading, comprises the following steps:
1) the even numbers group Tire tree construction set up based on correct word dictionary and wrongly written character word dictionary is utilized, maximum matching process centering sentence is adopted to carry out Precise Segmentation, set up accurate participle word figure, and the result of carrying out Precise Segmentation based on wrongly written character word dictionary is marked, the correct word that described Chinese sentence is corresponding with the wrongly written character word of wrongly written character word dictionary matching is joined in word figure simultaneously.Be specially:
First utilize correct word dictionary and wrongly written character word dictionary to carry out accurate participle, set up accurate participle word figure, wherein:
S: sentence to be slit; Dic1: correct word dictionary, Dic2: wrongly written character word dictionary, po1: correct dictionary lookup position; Pos2: wrongly written character word dictionary lookup position.
Step 11) set up the even numbers group Trie tree construction DicTrie of correct word dictionary Dic1;
Step 12) set up even numbers group Trie tree construction TypoDicTrie:(TypoWord, the CorrectWord of wrongly written character word dictionary Dic2), wherein TypoWord is wrongly written character word, and CorrectWord is the correct word that this wrongly written character word is corresponding; Such as (for no reason at all without ancient, gratuitous);
Step 13) based on the even numbers group Trie tree construction DicTrie of correct word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, word after cutting is joined in word figure and set up accurate participle word figure, as shown in Figure 1, the present embodiment represents Precise Segmentation by solid box in word figure;
In the present embodiment be: utilize correct dictionary Dic1 from pos1 (being initially set to 0) position forward direction maximum search, suppose to search out correct word entry word1, added the position after accurate participle word figure, pos1 are updated to word1; Otherwise pos1 points to the next word of current location; Repeat search is until pos1 performs the end of sentence S; Step 14) based on the even numbers group Trie tree construction TypoDicTrie of wrongly written character word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, and sentence is marked: the word wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence being labeled as mistake, and mark the correct word CorrectWord corresponding with it; Joined in accurate participle word figure by correct word CorrectWord corresponding for each wrong word word TypoWord in sentence, as shown in Figure 1, the present embodiment is indicated by the dashed box in word figure simultaneously.
In the present embodiment be: utilize wrong dictionary Dic2 from pos2 (being initially set to 0) position forward direction maximum search, if search out wrongly written character word TypoWord, the correct entry CorrectWord of its correspondence is added accurate participle word figure, and the correct word of the wrongly written character word in sentence and correspondence thereof is marked, pos2 be updated to TypoWord after position; Otherwise pos2 points to the next word of current location; Repeat search is until pos1 performs the end of sentence S.
Citing, sentence S=" why frequent you are without former expense of living of taking off me without reason ".
Through above-mentioned steps 13) accurately after participle, result as shown in Figure 1, " you ", " why ", " often ", "None", " former ", " without reason ", " button ", " getting ", " I ", " ", " work ", " expense " result that is Precise Segmentation, represent by solid box in word figure;
Through above-mentioned steps 14) accurately after participle, result as shown in Figure 1, wherein because (without without reason former, gratuitous) be word in wrongly written character word dictionary, after utilizing it to carry out participle, "None", " former ", " without reason " replace after for " gratuitous ", be indicated by the dashed box in word figure.
2) Method of Fuzzy Matching is adopted to carry out fuzzy matching to the loose string in the word segmentation result of Precise Segmentation, obtain and loose corresponding similar word and the similarity thereof of going here and there, go here and there corresponding similar word by what obtain join accurate participle word figure with loose, form fuzzy participle word figure.Specifically comprise:
Traversal is by step 1) character in Chinese sentence accurately after participle, Method of Fuzzy Matching is adopted to carry out fuzzy matching to each character, described Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, and described individual character replaces with to be replaced based on the individual character that shape is similar and/or replace based on the individual character that sound is similar; By the similarity of the character string in the fuzzy matching of Chinese string similarity formulae discovery and the loose string corresponding with it; Judge whether similarity is not less than threshold value t w, the similar word of the character string in the fuzzy matching of threshold value as the loose string corresponding with it is not less than to similarity, and it can be used as fuzzy matching node to join in accurate participle word figure to form fuzzy participle word figure, until the character in sentence has been traversed; Above by the character string W in the fuzzy matching of Chinese string similarity formulae discovery 2and the loose string W corresponding with it 2similarity be:
Wherein: Chinese string W 1=c 1c 2c n, W 2=d 1d 2d m, editdis (W 1, W 2) be the distance function of two character strings:
Wherein: sim (c i, d i) be Chinese character c iwith d isimilarity:
Wherein: PSim (c i, d i) be Chinese character c iwith Chinese character d ipinyin similarity, SSim (c i, d i) be Chinese character c iwith Chinese character d ishape similarity, α and β represents the weight of pinyin similarity and shape similarity respectively, alpha+beta=1.
For the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.
For the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.
For the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.
Specifically in the present embodiment, realized by following steps:
Step 20) the position nCurr=0 of initial coupling of given Chinese sentence;
Step 21) the current location nCurr of therefrom sentence, reads in current character, carries out fuzzy matching to current character;
In fuzzy process, the word of current location can be that individual character is replaced (by the shape of word the similar or similar replacement of sound), also can is that multiword or scarce word calculate similarity;
Step 22) utilize the similarity of Chinese string similarity formulae discovery two character strings, the i.e. former string of fuzzy matching in sentence and the similarity of the word matched, also can be described as the similarity of character string in fuzzy matching and the loose string corresponding with it, such as, in accompanying drawing 1:
" without former ", by obtaining similar Chinese character " edge " etc. to the pinyin similarity of " former " and shape Similarity Measure, utilizes Chinese string calculating formula of similarity (1), calculates the similarity that Chinese string " without former " " has no chance " with the word in Chinese dictionary.
In the present embodiment, user's input method is spelling input method and character-shape input method, therefore sets α=β=0.5;
Step 23) if similarity is less than threshold value t w, then nCurr=nCurr+1, enters step 21), otherwise enter step 24); Because the degree of aliasing of Chinese character is very high, in the present embodiment, described threshold value t wbe 0.95, can certainly adjust according to practical application, as 0.90,0.92,0.98 etc.;
Step 24) then similarity be not less than threshold value t wobtain one group of similar word and similarity (sFuzzyWord, next, sim), sFuzzyWord is the word matched, next is that the next one will read in the node location (next=nCur+1) carrying out fuzzy matching, and sim is similarity, obtains for carrying out calculating Similarity Measure to the former string of the position that coupling stops and sFuzzyWord from reference position nCurr; If next position is the length of sentence, then terminate, otherwise renewal nCurr is the position next that the next one will read in, rebound step 21);
Step 25) similarity of fuzzy matching is not less than threshold value t wsimilar word, join accurate participle word figure as fuzzy matching node, form fuzzy participle word figure; As shown in Figure 1, the present embodiment is indicated by the dashed box in word figure.
In the example that the present embodiment Fig. 1 provides, loose string "None", " former " find the word in dictionary " to have no chance " by the similar fuzzy matching of sound, loose string " work ", " expense " lack word fuzzy matching by shape phase Sihe and find " telephone expenses ", " cost of living " in dictionary, the node of these fuzzy matching is joined in word figure, be indicated by the dashed box in word figure.
3) based on the binary model of the word in conjunction with similarity, calculate the shortest path of fuzzy participle word figure, thus obtaining final cutting result, the former string that the fuzzy matching node in mark cutting result is corresponding is the mistake found, to realize Chinese non-multi-character word error automatic Proofreading.Specifically comprise:
Step 31) based on step 1) accurate participle and step 2 are carried out to sentence) and fuzzy matching is carried out to sentence after the fuzzy participle word figure that obtains, obtain mulitpath, integrating step 2) that obtain with loose corresponding similar word and the similarity thereof of going here and there, adopt binary model to calculate the probability of often kind of cutting sequence:
The present invention adopts the binary model in conjunction with the word of similarity to calculate the probability after cutting, to the result of fuzzy cutting, adds certain punishment: wherein W is a certain bar participle path in word figure, W ifor the word of i-th in path, n is the number of word in participle path; α (W i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, if current word is Precise Segmentation, α (W i-1, W ')=1, otherwise α (W i-1, W ') and=sim (W i-1, W '), i.e. the former string W' of fuzzy matching in sentence and the word W matched i-1similarity, also can be described as the character string W in fuzzy matching i-1and the similarity of the loose string W' corresponding with it;
Step 32) according to step 31) the fuzzy participle word figure that obtains, utilize the dijkstra's algorithm of figure to solve shortest path, thus obtain final cutting result;
Step 33) to the fuzzy matching node in shortest path, the former string marking its correspondence is the word containing wrongly written or mispronounced characters, and the similar word that fuzzy matching obtains is its corresponding correct word, thus achieve Chinese non-multi-character word error automatic Proofreading.
In the example of the present embodiment provided as Fig. 1, through the word figure that accurate participle and fuzzy participle generate, the binary model in conjunction with similarity is adopted to carry out solving the shortest path to this figure, obtain path: Path={ " S ", " you ", " often ", " why ", " gratuitous ", " button ", " get ", " I ", " ", " telephone expenses " } maximum probability, namely be the shortest path of figure, wherein dotted line frame node " gratuitous " in path, the node that " telephone expenses " are fuzzy matching, former string " without without reason former " then in former sentence, wrongly written or mispronounced characters is comprised in " expense of living ", the word " gratuitous " correct with fuzzy matching, " telephone expenses " compare, " former ", " work " is the wrongly written or mispronounced characters in sentence, " without without reason former ", " expense of living " is non-multi-character word error.
Four, test
Live through repeatedly open test, the testing material of experiment employing 20,000 row sentence, wherein comprise 664 place's non-multi-character word error, wherein non-multi-character word error comprises malapropism replaced type non-multi-character word error, word insert type non-multi-character word error and word deletion type non-multi-character word error.Experimental result shows, non-multi-character word error identification recall rate provided by the invention reaches 75.9%, and precision is 85%, correction rate reaches 62%, and error correction accuracy rate is 81.7%, and this precision has exceeded prior art, reach the demand of practical application, there is higher validity and accuracy.
Above implementation column is only preferred embodiment of the present invention, does not form restriction to the present invention, and relevant staff is in the scope not departing from the technology of the present invention thought, and any amendment carried out, equivalent replacement, improvement etc., all drop in protection scope of the present invention.

Claims (9)

1., based on a non-multi-character word error auto-collation for fuzzy participle, it is characterized in that carrying out automatic Proofreading by the method for fuzzy participle, comprise the following steps:
1) the even numbers group Tire tree construction set up based on correct word dictionary and wrongly written character word dictionary is utilized, maximum matching process centering sentence is adopted to carry out Precise Segmentation, set up accurate participle word figure, and the result of carrying out Precise Segmentation based on wrongly written character word dictionary is marked, the correct word that described Chinese sentence is corresponding with the wrongly written character word of wrongly written character word dictionary matching is joined in word figure simultaneously;
2) Method of Fuzzy Matching is adopted to carry out fuzzy matching to the loose string in the word segmentation result of Precise Segmentation, obtain and loose corresponding similar word and the similarity thereof of going here and there, go here and there corresponding similar word by what obtain join accurate participle word figure with loose, form fuzzy participle word figure;
3) based on the binary model of the word in conjunction with similarity, calculate the shortest path of fuzzy participle word figure, thus obtaining final cutting result, the former string that the fuzzy matching node in mark cutting result is corresponding is the mistake found, to realize Chinese non-multi-character word error automatic Proofreading.
2. the non-multi-character word error auto-collation based on fuzzy participle according to claim 1, is characterized in that described step 1) comprise the following steps:
Step 11) set up the even numbers group Trie tree construction DicTrie of correct word dictionary;
Step 12) set up even numbers group Trie tree construction TypoDicTrie:(TypoWord, the CorrectWord of wrongly written character word dictionary), wherein TypoWord is wrongly written character word, and CorrectWord is the correct word that this wrongly written character word is corresponding;
Step 13) based on the even numbers group Trie tree construction DicTrie of correct word dictionary, adopt maximum matching process to carry out Precise Segmentation to described Chinese sentence, the word after cutting is joined in word figure and set up accurate participle word figure;
Step 14) based on the even numbers group Trie tree construction TypoDicTrie of wrongly written character word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, and sentence is marked: the word wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence being labeled as mistake, and mark the correct word CorrectWord corresponding with it; Correct word CorrectWord corresponding for each wrong word word TypoWord in sentence is joined in accurate participle word figure simultaneously.
3. the non-multi-character word error auto-collation based on fuzzy participle according to claim 1, is characterized in that described step 2) comprising:
Traversal is by step 1) character in Chinese sentence accurately after participle, adopt Method of Fuzzy Matching to carry out fuzzy matching to each character; Character string in calculating fuzzy matching and the similarity of the loose string corresponding with it; Judge whether similarity is not less than threshold value t w, the similar word of the character string in the fuzzy matching of threshold value as the loose string corresponding with it is not less than to similarity, and it can be used as fuzzy matching node to join in accurate participle word figure to form fuzzy participle word figure, until the character in sentence has been traversed;
Character string W in wherein said calculating fuzzy matching 2and the loose string W corresponding with it 1similarity be:
S i m ( W 1 , W 2 ) = 1 - e d i t d i s ( W 1 W 2 ) m a x ( m , n ) - - - ( 1 ) ;
Wherein: Chinese string W 1=c 1c 2... c n, W 2=d 1d 2... d m, editdis (W 1, W 2) be the distance function of two character strings:
e d i t d i s ( W 1 , W 2 ) = max { e d i t d i s ( c 2 ... c n , d 1 ... d m ) + 1 e d i t d i s ( c 1 ... c n , d 2 ... d m ) + 1 e d i t d i s ( c 2 ... c n , d 1 ... d m ) + 1 ( 1 - s i m ( c 1 , d 1 ) ) - - - ( 2 ) ;
Wherein: sim (c i, d i) be Chinese character c iwith d isimilarity:
Wherein: PSim (c i, d i) be Chinese character c iwith Chinese character d ipinyin similarity, SSim (c i, d i) be Chinese character c iwith Chinese character d ishape similarity, α and β represents the weight of pinyin similarity and shape similarity respectively, alpha+beta=1.
4. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: described Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, described individual character replaces with to be replaced based on the individual character that shape is similar and/or replaces based on the individual character that sound is similar.
5. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.
6. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.
7. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.
8. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, is characterized in that described step 3) comprise the following steps:
Step 31) based on step 1) accurate participle and step 2 are carried out to sentence) and fuzzy matching is carried out to sentence after the fuzzy participle word figure that obtains, obtain mulitpath, integrating step 2) that obtain with loose corresponding similar word and the similarity thereof of going here and there, adopt binary model to calculate the probability of often kind of cutting sequence:
W * = arg max W P ( W ) = arg max W p ( W 1 ) Π i = 2 n p ( W i | W i - 1 ) * α ( W i - 1 , W ′ ) - - - ( 4 ) ;
Wherein W is a certain bar participle path in word figure, W ifor the word of i-th in path, n is the number of word in participle path; α (W i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, the α (W when current word is Precise Segmentation i-1, W ')=1, otherwise α (W i-1, W ') and=sim (W i-1, W '), i.e. the former string W ' and the word W matched of fuzzy matching in sentence i-1similarity, also referred to as the character string W in fuzzy matching i-1and the similarity of the loose string W ' corresponding with it;
Step 32) according to step 31) the fuzzy participle word figure that obtains, utilize the dijkstra's algorithm of figure to solve shortest path, thus obtain final cutting result;
Step 33) to the fuzzy matching node in shortest path, the former string marking its correspondence is the word containing wrongly written or mispronounced characters, and the similar word that fuzzy matching obtains is its corresponding correct word, thus achieve Chinese non-multi-character word error automatic Proofreading.
9. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, is characterized in that, described threshold value t wbe 0.95.
CN201510361877.8A 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle Active CN104991889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510361877.8A CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510361877.8A CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Publications (2)

Publication Number Publication Date
CN104991889A true CN104991889A (en) 2015-10-21
CN104991889B CN104991889B (en) 2018-02-02

Family

ID=54303705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510361877.8A Active CN104991889B (en) 2015-06-26 2015-06-26 A kind of non-multi-character word error auto-collation based on fuzzy participle

Country Status (1)

Country Link
CN (1) CN104991889B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105573979A (en) * 2015-12-10 2016-05-11 江苏科技大学 Chinese character confusion set based wrong word knowledge generation method
CN106527757A (en) * 2016-10-28 2017-03-22 上海智臻智能网络科技股份有限公司 Input error correction method and apparatus
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN106547741A (en) * 2016-11-21 2017-03-29 江苏科技大学 A kind of Chinese language text auto-collation based on collocation
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN106610953A (en) * 2016-09-30 2017-05-03 四川用联信息技术有限公司 Method for solving text similarity based on Gini index
CN108572998A (en) * 2017-03-14 2018-09-25 北京橙鑫数据科技有限公司 A kind of data search method and device for electronic card data
CN108717412A (en) * 2018-06-12 2018-10-30 北京览群智数据科技有限责任公司 Chinese check and correction error correction method based on Chinese word segmentation and system
CN108766437A (en) * 2018-05-31 2018-11-06 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN109492202A (en) * 2018-11-12 2019-03-19 浙江大学山东工业技术研究院 A kind of Chinese error correction of coding and decoded model based on phonetic
CN109558596A (en) * 2018-12-14 2019-04-02 平安城市建设科技(深圳)有限公司 Recognition methods, device, terminal and computer readable storage medium
CN109657738A (en) * 2018-10-25 2019-04-19 平安科技(深圳)有限公司 Character identifying method, device, equipment and storage medium
CN110020005A (en) * 2019-03-28 2019-07-16 云知声(上海)智能科技有限公司 Symptom matching process in main suit and present illness history in a kind of case history
CN111209748A (en) * 2019-12-16 2020-05-29 合肥讯飞数码科技有限公司 Wrong-recognized word recognition method, related equipment and readable storage medium
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN112954387A (en) * 2021-01-26 2021-06-11 广州欢网科技有限责任公司 Method, system and readable storage medium for updating and optimizing television program list
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method
CN114091436A (en) * 2022-01-21 2022-02-25 万商云集(成都)科技股份有限公司 Sensitive word detection method based on decision tree and variant recognition
CN114781371A (en) * 2022-04-07 2022-07-22 山东新一代信息产业技术研究院有限公司 Chinese word segmentation method based on statistics and dictionary

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
刘亮亮 等: "领域问答系统中的文本错误自动发现方法", 《中文信息学报》 *
张仰森 等: "基于规则与统计相结合的中文文本自动查错模型与算法", 《中文信息学报》 *
张华平 等: "基于N-最短路径方法的中文词语粗分模型", 《中文信息学报》 *
张磊 等: "基于快速模糊词匹配算法的中文自动校对方法", 《PROCEEDINGS OF THE 3RD WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION》 *
施恒利 等: "汉字种子混淆集的构建方法研究", 《计算机科学》 *
施恒利: "汉字种子混淆集的构建方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王思力 等: "双数组Trie树算法优化及其应用研究", 《中文信息学报》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573979B (en) * 2015-12-10 2018-05-22 江苏科技大学 A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN105573979A (en) * 2015-12-10 2016-05-11 江苏科技大学 Chinese character confusion set based wrong word knowledge generation method
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105512110B (en) * 2015-12-15 2018-04-06 江苏科技大学 A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN106610953A (en) * 2016-09-30 2017-05-03 四川用联信息技术有限公司 Method for solving text similarity based on Gini index
CN106598939B (en) * 2016-10-21 2019-09-17 北京三快在线科技有限公司 A kind of text error correction method and device, server, storage medium
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN106527757A (en) * 2016-10-28 2017-03-22 上海智臻智能网络科技股份有限公司 Input error correction method and apparatus
CN106528532B (en) * 2016-11-07 2019-03-12 上海智臻智能网络科技股份有限公司 Text error correction method, device and terminal
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN106547741A (en) * 2016-11-21 2017-03-29 江苏科技大学 A kind of Chinese language text auto-collation based on collocation
CN108572998A (en) * 2017-03-14 2018-09-25 北京橙鑫数据科技有限公司 A kind of data search method and device for electronic card data
CN108766437A (en) * 2018-05-31 2018-11-06 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN108717412A (en) * 2018-06-12 2018-10-30 北京览群智数据科技有限责任公司 Chinese check and correction error correction method based on Chinese word segmentation and system
CN109657738A (en) * 2018-10-25 2019-04-19 平安科技(深圳)有限公司 Character identifying method, device, equipment and storage medium
CN109657738B (en) * 2018-10-25 2024-04-30 平安科技(深圳)有限公司 Character recognition method, device, equipment and storage medium
WO2020082562A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Symbol identification method, apparatus, device, and storage medium
CN109492202B (en) * 2018-11-12 2022-12-27 浙江大学山东工业技术研究院 Chinese error correction method based on pinyin coding and decoding model
CN109492202A (en) * 2018-11-12 2019-03-19 浙江大学山东工业技术研究院 A kind of Chinese error correction of coding and decoded model based on phonetic
CN109558596A (en) * 2018-12-14 2019-04-02 平安城市建设科技(深圳)有限公司 Recognition methods, device, terminal and computer readable storage medium
CN110020005B (en) * 2019-03-28 2021-03-26 云知声(上海)智能科技有限公司 Method for matching main complaints in medical records with symptoms in current medical history
CN110020005A (en) * 2019-03-28 2019-07-16 云知声(上海)智能科技有限公司 Symptom matching process in main suit and present illness history in a kind of case history
CN111209748A (en) * 2019-12-16 2020-05-29 合肥讯飞数码科技有限公司 Wrong-recognized word recognition method, related equipment and readable storage medium
CN111209748B (en) * 2019-12-16 2023-10-24 合肥讯飞数码科技有限公司 Error word recognition method, related device and readable storage medium
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN112954387A (en) * 2021-01-26 2021-06-11 广州欢网科技有限责任公司 Method, system and readable storage medium for updating and optimizing television program list
CN114091436A (en) * 2022-01-21 2022-02-25 万商云集(成都)科技股份有限公司 Sensitive word detection method based on decision tree and variant recognition
CN114781371A (en) * 2022-04-07 2022-07-22 山东新一代信息产业技术研究院有限公司 Chinese word segmentation method based on statistics and dictionary

Also Published As

Publication number Publication date
CN104991889B (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN104991889A (en) Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105045778A (en) Chinese homonym error auto-proofreading method
Ling et al. Latent predictor networks for code generation
CN112016304A (en) Text error correction method and device, electronic equipment and storage medium
CN112801010A (en) Visual rich document information extraction method for actual OCR scene
CN106528526B (en) A kind of Chinese address semanteme marking method based on Bayes's segmentation methods
CN105279149A (en) Chinese text automatic correction method
CN108519974A (en) English composition automatic detection of syntax error and analysis method
CN103020022A (en) Chinese unregistered word recognition system and method based on improvement information entropy characteristics
CN105512110A (en) Wrong word knowledge base construction method based on fuzzy matching and statistics
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN111444706A (en) Referee document text error correction method and system based on deep learning
CN100543735C (en) File similarity measure method based on file structure
CN104699797A (en) Webpage data structured analytic method and device
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
CN105824800A (en) Automatic Chinese real word error proofreading method
CN107832297A (en) A kind of field sentiment dictionary construction method of Feature Oriented word granularity
CN110222338A (en) A kind of mechanism name entity recognition method
CN111428501A (en) Named entity recognition method, recognition system and computer readable storage medium
CN110705261B (en) Chinese text word segmentation method and system thereof
CN106528863A (en) Training and technology of CRF recognizer and method for extracting attribute name relation pairs of CRF recognizer
CN103714053B (en) Japanese verb identification method for machine translation
CN104572618A (en) Question-answering system semantic-based similarity analyzing method, system and application
CN108763218A (en) A kind of video display retrieval entity recognition method based on CRF

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20151021

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Denomination of invention: An automatic proofreading method for non multi word errors based on fuzzy segmentation

Granted publication date: 20180202

License type: Common License

Record date: 20201029

EC01 Cancellation of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Date of cancellation: 20201223

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221222

Address after: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee after: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Address before: 212003, No. 2, Mengxi Road, Zhenjiang, Jiangsu

Patentee before: JIANGSU University OF SCIENCE AND TECHNOLOGY

Effective date of registration: 20221222

Address after: Room 606-609, Compound Office Complex Building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou, Guangdong Province, 510699

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee before: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.