CN104991889A

CN104991889A - Fuzzy word segmentation based non-multi-character word error automatic proofreading method

Info

Publication number: CN104991889A
Application number: CN201510361877.8A
Authority: CN
Inventors: 刘亮亮; 吴健康
Original assignee: Jiangsu University of Science and Technology
Current assignee: China Southern Power Grid Internet Service Co ltd; Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-10-21
Anticipated expiration: 2035-06-26
Also published as: CN104991889B

Abstract

The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

Description

A kind of non-multi-character word error auto-collation based on fuzzy participle

Technical field

The present invention relates to the natural language processing in artificial intelligence computer field, particularly automatic proofreading for Chinese texts field.

Background technology

Along with the high speed development of the information processing technology and internet, traditional text work almost all replace by computing machine, the e-texts such as e-book, electronic newspaper, Email, office document, blog, microblogging etc. all become a part for people's daily life, but the mistake in text also gets more and more, this brings very large challenge to proof-reading.Traditional artificial correction efficiency is low, intensity is large, the cycle long demand that obviously can not meet text proofreading.

Text automatic Proofreading is one of main application of natural language processing, is also a difficult problem for natural language understanding.Along with the development of technology, English text automatic Proofreading obtains extraordinary effect, commercialization.Compare and English, Chinese language text automatic Proofreading has a following difficult problem:

1) Chinese text check and correction is not similar to English " non-word mistake "---the word not in dictionary, can find mistake by looking up the dictionary; Chinese character in Chinese text all can appear in dictionary.

2) first Chinese text check and correction will carry out Chinese word segmentation, if there is wrongly written or mispronounced characters in a word, can be divided into the loose string of individual character when participle---and non-multi-character word error, this brings difficulty to the error-checking method of Chinese text.

3) occur in Chinese that the individual character string that falls apart not necessarily has wrongly written or mispronounced characters, because Chinese individual character becomes the ability of word very strong;

4) except non-multi-character word error, often a word is wrongly write into the word in another one dictionary in Chinese, this mistake is called true word mistake, and this is also the difficult point of automatic proofreading for Chinese texts;

For above-mentioned Railway Project, the present invention proposes and achieves automatic errordetecting and the auto-collation of Chinese non-multi-character word error.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of non-multi-character word error auto-collation based on fuzzy participle.

Technical scheme: for solving the problems of the technologies described above, the invention provides a kind of non-multi-character word error auto-collation based on fuzzy participle, the method carries out automatic Proofreading by the method for fuzzy participle, comprises the following steps:

1) the even numbers group Tire tree construction set up based on correct word dictionary and wrongly written character word dictionary is utilized, maximum matching process centering sentence is adopted to carry out Precise Segmentation, set up accurate participle word figure, and the result of carrying out Precise Segmentation based on wrongly written character word dictionary is marked, the correct word that described Chinese sentence is corresponding with the wrongly written character word of wrongly written character word dictionary matching is joined in word figure simultaneously;

2) Method of Fuzzy Matching is adopted to carry out fuzzy matching to the loose string in the word segmentation result of Precise Segmentation, obtain and loose corresponding similar word and the similarity thereof of going here and there, go here and there corresponding similar word by what obtain join accurate participle word figure with loose, form fuzzy participle word figure;

3) based on the binary model of the word in conjunction with similarity, calculate the shortest path of fuzzy participle word figure, thus obtaining final cutting result, the former string that the fuzzy matching node in mark cutting result is corresponding is the mistake found, to realize Chinese non-multi-character word error automatic Proofreading.

Preferably, described step 1) comprise the following steps:

Step 11) set up the even numbers group Trie tree construction DicTrie of correct word dictionary;

Step 12) set up even numbers group Trie tree construction TypoDicTrie:(TypoWord, the CorrectWord of wrongly written character word dictionary), wherein TypoWord is wrongly written character word, and CorrectWord is the correct word that this wrongly written character word is corresponding;

Step 13) based on the even numbers group Trie tree construction DicTrie of correct word dictionary, adopt maximum matching process to carry out Precise Segmentation to described Chinese sentence, the word after cutting is joined in word figure and set up accurate participle word figure;

Step 14) based on the even numbers group Trie tree construction TypoDicTrie of wrongly written character word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, and sentence is marked: the word wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence being labeled as mistake, and mark the correct word CorrectWord corresponding with it; Correct word CorrectWord corresponding for each wrong word word TypoWord in sentence is joined in accurate participle word figure simultaneously.

Preferably, described step 2) comprising:

Traversal is by step 1) character in Chinese sentence accurately after participle, adopt Method of Fuzzy Matching to carry out fuzzy matching to each character; Character string in calculating fuzzy matching and the similarity of the loose string corresponding with it; Judge whether similarity is not less than threshold value t _w, the similar word of the character string in the fuzzy matching of threshold value as the loose string corresponding with it is not less than to similarity, and it can be used as fuzzy matching node to join in accurate participle word figure to form fuzzy participle word figure, until the character in sentence has been traversed;

Character string W in wherein said calculating fuzzy matching ₂and the loose string W corresponding with it ₁similarity be:

Wherein: Chinese string W ₁=c ₁c ₂c _n, W ₂=d ₁d ₂d _m, editdis (W ₁, W ₂) be the distance function of two character strings:

Wherein: sim (c _i, d _i) be Chinese character c _iwith d _isimilarity:

Wherein: PSim (c _i, d _i) be Chinese character c _iwith Chinese character d _ipinyin similarity, SSim (c _i, d _i) be Chinese character c _iwith Chinese character d _ishape similarity, α and β represents the weight of pinyin similarity and shape similarity respectively, alpha+beta=1.

Preferably, above-mentioned Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, and described individual character replaces with to be replaced based on the individual character that shape is similar and/or replace based on the individual character that sound is similar.

Preferably, for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.

As preferably, for the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.

Preferably, for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.

Preferably, described step 3) comprise the following steps:

Step 31) based on step 1) accurate participle and step 2 are carried out to sentence) and fuzzy matching is carried out to sentence after the fuzzy participle word figure that obtains, obtain mulitpath, integrating step 2) that obtain with loose corresponding similar word and the similarity thereof of going here and there, adopt binary model to calculate the probability of often kind of cutting sequence:

Wherein W is a certain bar participle path in word figure, W _ifor the word of i-th in path, n is the number of word in participle path; α (W _i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, the α (W when current word is Precise Segmentation _i-1, W ')=1, otherwise α (W _i-1, W ') and=sim (W _i-1, W '), i.e. the former string W' of fuzzy matching in sentence and the word W matched _i-1similarity, also referred to as the character string W in fuzzy matching _i-1and the similarity of the loose string W' corresponding with it;

Step 32) according to step 31) the fuzzy participle word figure that obtains, utilize the dijkstra's algorithm of figure to solve shortest path, thus obtain final cutting result;

Step 33) to the fuzzy matching node in shortest path, the former string marking its correspondence is the word containing wrongly written or mispronounced characters, and the similar word that fuzzy matching obtains is its corresponding correct word, thus achieve Chinese non-multi-character word error automatic Proofreading.

Preferably, above-mentioned threshold value t _wbe 0.95.

Beneficial effect: the present invention proposes a kind of non-multi-character word error auto-collation based on fuzzy participle.The method can effectively identify " non-multi-character word error " in Chinese language text and proofread in the process of participle, and the method based on even numbers group Trie tree adopted can carry out fuzzy participle fast.Experiment shows, the method recall rate of " non-multi-character word error " automatic Proofreading of fuzzy participle provided by the invention reaches 75.9%, and precision reaches 85%, and correction rate reaches 62%, error correction rate of accuracy reached 81.7%.The realistic application demand of faster system response, precision, validity and accuracy high, there is higher practicality.

Accompanying drawing explanation

The fuzzy segmenting word illustrated example that Fig. 1 is provided by the invention.

Embodiment

Below in conjunction with drawings and Examples, the present invention is further described.

A kind of non-multi-character word error auto-collation based on fuzzy participle provided by the invention, the method based on fuzzy participle carries out automatic Proofreading, comprises the following steps:

1) the even numbers group Tire tree construction set up based on correct word dictionary and wrongly written character word dictionary is utilized, maximum matching process centering sentence is adopted to carry out Precise Segmentation, set up accurate participle word figure, and the result of carrying out Precise Segmentation based on wrongly written character word dictionary is marked, the correct word that described Chinese sentence is corresponding with the wrongly written character word of wrongly written character word dictionary matching is joined in word figure simultaneously.Be specially:

First utilize correct word dictionary and wrongly written character word dictionary to carry out accurate participle, set up accurate participle word figure, wherein:

S: sentence to be slit; Dic1: correct word dictionary, Dic2: wrongly written character word dictionary, po1: correct dictionary lookup position; Pos2: wrongly written character word dictionary lookup position.

Step 11) set up the even numbers group Trie tree construction DicTrie of correct word dictionary Dic1;

Step 12) set up even numbers group Trie tree construction TypoDicTrie:(TypoWord, the CorrectWord of wrongly written character word dictionary Dic2), wherein TypoWord is wrongly written character word, and CorrectWord is the correct word that this wrongly written character word is corresponding; Such as (for no reason at all without ancient, gratuitous);

Step 13) based on the even numbers group Trie tree construction DicTrie of correct word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, word after cutting is joined in word figure and set up accurate participle word figure, as shown in Figure 1, the present embodiment represents Precise Segmentation by solid box in word figure;

In the present embodiment be: utilize correct dictionary Dic1 from pos1 (being initially set to 0) position forward direction maximum search, suppose to search out correct word entry word1, added the position after accurate participle word figure, pos1 are updated to word1; Otherwise pos1 points to the next word of current location; Repeat search is until pos1 performs the end of sentence S; Step 14) based on the even numbers group Trie tree construction TypoDicTrie of wrongly written character word dictionary, maximum matching process is adopted to carry out Precise Segmentation to described Chinese sentence, and sentence is marked: the word wrongly written character word TypoWord in the wrongly written character word dictionary searched out in sentence being labeled as mistake, and mark the correct word CorrectWord corresponding with it; Joined in accurate participle word figure by correct word CorrectWord corresponding for each wrong word word TypoWord in sentence, as shown in Figure 1, the present embodiment is indicated by the dashed box in word figure simultaneously.

In the present embodiment be: utilize wrong dictionary Dic2 from pos2 (being initially set to 0) position forward direction maximum search, if search out wrongly written character word TypoWord, the correct entry CorrectWord of its correspondence is added accurate participle word figure, and the correct word of the wrongly written character word in sentence and correspondence thereof is marked, pos2 be updated to TypoWord after position; Otherwise pos2 points to the next word of current location; Repeat search is until pos1 performs the end of sentence S.

Citing, sentence S=" why frequent you are without former expense of living of taking off me without reason ".

Through above-mentioned steps 13) accurately after participle, result as shown in Figure 1, " you ", " why ", " often ", "None", " former ", " without reason ", " button ", " getting ", " I ", " ", " work ", " expense " result that is Precise Segmentation, represent by solid box in word figure;

Through above-mentioned steps 14) accurately after participle, result as shown in Figure 1, wherein because (without without reason former, gratuitous) be word in wrongly written character word dictionary, after utilizing it to carry out participle, "None", " former ", " without reason " replace after for " gratuitous ", be indicated by the dashed box in word figure.

2) Method of Fuzzy Matching is adopted to carry out fuzzy matching to the loose string in the word segmentation result of Precise Segmentation, obtain and loose corresponding similar word and the similarity thereof of going here and there, go here and there corresponding similar word by what obtain join accurate participle word figure with loose, form fuzzy participle word figure.Specifically comprise:

Traversal is by step 1) character in Chinese sentence accurately after participle, Method of Fuzzy Matching is adopted to carry out fuzzy matching to each character, described Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, and described individual character replaces with to be replaced based on the individual character that shape is similar and/or replace based on the individual character that sound is similar; By the similarity of the character string in the fuzzy matching of Chinese string similarity formulae discovery and the loose string corresponding with it; Judge whether similarity is not less than threshold value t _w, the similar word of the character string in the fuzzy matching of threshold value as the loose string corresponding with it is not less than to similarity, and it can be used as fuzzy matching node to join in accurate participle word figure to form fuzzy participle word figure, until the character in sentence has been traversed; Above by the character string W in the fuzzy matching of Chinese string similarity formulae discovery ₂and the loose string W corresponding with it ₂similarity be:

Wherein: sim (c _i, d _i) be Chinese character c _iwith d _isimilarity:

For the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.

For the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.

For the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.

Specifically in the present embodiment, realized by following steps:

Step 20) the position nCurr=0 of initial coupling of given Chinese sentence;

Step 21) the current location nCurr of therefrom sentence, reads in current character, carries out fuzzy matching to current character;

In fuzzy process, the word of current location can be that individual character is replaced (by the shape of word the similar or similar replacement of sound), also can is that multiword or scarce word calculate similarity;

Step 22) utilize the similarity of Chinese string similarity formulae discovery two character strings, the i.e. former string of fuzzy matching in sentence and the similarity of the word matched, also can be described as the similarity of character string in fuzzy matching and the loose string corresponding with it, such as, in accompanying drawing 1:

" without former ", by obtaining similar Chinese character " edge " etc. to the pinyin similarity of " former " and shape Similarity Measure, utilizes Chinese string calculating formula of similarity (1), calculates the similarity that Chinese string " without former " " has no chance " with the word in Chinese dictionary.

In the present embodiment, user's input method is spelling input method and character-shape input method, therefore sets α=β=0.5;

Step 23) if similarity is less than threshold value t _w, then nCurr=nCurr+1, enters step 21), otherwise enter step 24); Because the degree of aliasing of Chinese character is very high, in the present embodiment, described threshold value t _wbe 0.95, can certainly adjust according to practical application, as 0.90,0.92,0.98 etc.;

Step 24) then similarity be not less than threshold value t _wobtain one group of similar word and similarity (sFuzzyWord, next, sim), sFuzzyWord is the word matched, next is that the next one will read in the node location (next=nCur+1) carrying out fuzzy matching, and sim is similarity, obtains for carrying out calculating Similarity Measure to the former string of the position that coupling stops and sFuzzyWord from reference position nCurr; If next position is the length of sentence, then terminate, otherwise renewal nCurr is the position next that the next one will read in, rebound step 21);

Step 25) similarity of fuzzy matching is not less than threshold value t _wsimilar word, join accurate participle word figure as fuzzy matching node, form fuzzy participle word figure; As shown in Figure 1, the present embodiment is indicated by the dashed box in word figure.

In the example that the present embodiment Fig. 1 provides, loose string "None", " former " find the word in dictionary " to have no chance " by the similar fuzzy matching of sound, loose string " work ", " expense " lack word fuzzy matching by shape phase Sihe and find " telephone expenses ", " cost of living " in dictionary, the node of these fuzzy matching is joined in word figure, be indicated by the dashed box in word figure.

3) based on the binary model of the word in conjunction with similarity, calculate the shortest path of fuzzy participle word figure, thus obtaining final cutting result, the former string that the fuzzy matching node in mark cutting result is corresponding is the mistake found, to realize Chinese non-multi-character word error automatic Proofreading.Specifically comprise:

The present invention adopts the binary model in conjunction with the word of similarity to calculate the probability after cutting, to the result of fuzzy cutting, adds certain punishment: wherein W is a certain bar participle path in word figure, W _ifor the word of i-th in path, n is the number of word in participle path; α (W _i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, if current word is Precise Segmentation, α (W _i-1, W ')=1, otherwise α (W _i-1, W ') and=sim (W _i-1, W '), i.e. the former string W' of fuzzy matching in sentence and the word W matched _i-1similarity, also can be described as the character string W in fuzzy matching _i-1and the similarity of the loose string W' corresponding with it;

In the example of the present embodiment provided as Fig. 1, through the word figure that accurate participle and fuzzy participle generate, the binary model in conjunction with similarity is adopted to carry out solving the shortest path to this figure, obtain path: Path={ " S ", " you ", " often ", " why ", " gratuitous ", " button ", " get ", " I ", " ", " telephone expenses " } maximum probability, namely be the shortest path of figure, wherein dotted line frame node " gratuitous " in path, the node that " telephone expenses " are fuzzy matching, former string " without without reason former " then in former sentence, wrongly written or mispronounced characters is comprised in " expense of living ", the word " gratuitous " correct with fuzzy matching, " telephone expenses " compare, " former ", " work " is the wrongly written or mispronounced characters in sentence, " without without reason former ", " expense of living " is non-multi-character word error.

Four, test

Live through repeatedly open test, the testing material of experiment employing 20,000 row sentence, wherein comprise 664 place's non-multi-character word error, wherein non-multi-character word error comprises malapropism replaced type non-multi-character word error, word insert type non-multi-character word error and word deletion type non-multi-character word error.Experimental result shows, non-multi-character word error identification recall rate provided by the invention reaches 75.9%, and precision is 85%, correction rate reaches 62%, and error correction accuracy rate is 81.7%, and this precision has exceeded prior art, reach the demand of practical application, there is higher validity and accuracy.

Above implementation column is only preferred embodiment of the present invention, does not form restriction to the present invention, and relevant staff is in the scope not departing from the technology of the present invention thought, and any amendment carried out, equivalent replacement, improvement etc., all drop in protection scope of the present invention.

Claims

1., based on a non-multi-character word error auto-collation for fuzzy participle, it is characterized in that carrying out automatic Proofreading by the method for fuzzy participle, comprise the following steps:

2. the non-multi-character word error auto-collation based on fuzzy participle according to claim 1, is characterized in that described step 1) comprise the following steps:

3. the non-multi-character word error auto-collation based on fuzzy participle according to claim 1, is characterized in that described step 2) comprising:

S i m (W_{1}, W_{2}) = 1 - \frac{e d i t d i s (W_{1} W_{2})}{m a x (m, n)} - - - (1);

Wherein: Chinese string W ₁=c ₁c ₂... c _n, W ₂=d ₁d ₂... d _m, editdis (W ₁, W ₂) be the distance function of two character strings:

e d i t d i s (W_{1}, W_{2}) = \max {\begin{matrix} e d i t d i s (c_{2} ... c_{n}, d_{1} ... d_{m}) + 1 \\ e d i t d i s (c_{1} ... c_{n}, d_{2} ... d_{m}) + 1 \\ e d i t d i s (c_{2} ... c_{n}, d_{1} ... d_{m}) + 1 (1 - s i m (c_{1}, d_{1})) \end{matrix} - - - (2);

Wherein: sim (c _i, d _i) be Chinese character c _iwith d _isimilarity:

4. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: described Method of Fuzzy Matching is replaced by individual character, multiword is replaced or lack word replacement carries out, described individual character replaces with to be replaced based on the individual character that shape is similar and/or replaces based on the individual character that sound is similar.

5. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method or phonitic entry method, weight α=1 of described pinyin similarity, weight beta=0 of shape similarity.

6. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation identifying error correction for OCR, weight α=0 of described pinyin similarity, weight beta=1 of shape similarity.

7. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, it is characterized in that: for the Chinese non-multi-character word error auto-collation for user's input method being spelling input method and character-shape input method, weight α=0.5 of described pinyin similarity, weight beta=0.5 of shape similarity.

8. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, is characterized in that described step 3) comprise the following steps:

\begin{matrix} W^{*} = \arg \max_{W} P (W) \\ = \arg \max_{W} p (W_{1}) Π_{i = 2}^{n} p (W_{i} | W_{i - 1}) * α (W_{i - 1}, W^{'}) \end{matrix} - - - (4);

Wherein W is a certain bar participle path in word figure, W _ifor the word of i-th in path, n is the number of word in participle path; α (W _i-1, W ') and represent to former string in sentence participle process to be the penalty value that the loose string corresponding with fuzzy matching node gives, the α (W when current word is Precise Segmentation _i-1, W ')=1, otherwise α (W _i-1, W ') and=sim (W _i-1, W '), i.e. the former string W ' and the word W matched of fuzzy matching in sentence _i-1similarity, also referred to as the character string W in fuzzy matching _i-1and the similarity of the loose string W ' corresponding with it;

9. the non-multi-character word error auto-collation based on fuzzy participle according to claim 3, is characterized in that, described threshold value t _wbe 0.95.