CN105045778B

CN105045778B - A kind of Chinese homonym mistake auto-collation

Info

Publication number: CN105045778B
Application number: CN201510354692.4A
Authority: CN
Inventors: 吴健康; 严熙; 刘亮亮
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2015-06-24
Filing date: 2015-06-24
Publication date: 2017-10-17
Anticipated expiration: 2035-06-24
Also published as: CN105045778A

Abstract

The invention discloses a kind of Chinese homonym mistake auto-collation, the homonym that this method firstly generates Chinese word obscures collection, then left adjacent binary is counted by the training of extensive Web language materials, right adjacent binary, adjacent ternary model, obscure collection using homonym and probability Estimation algorithm obtains local adjacent NGram models, then the combined method of Weight is utilized, context support of the homonym of concentration in sentence is obscured by the word and its corresponding homonym that calculate in sentence, judge whether homonym mistake, and homonym mistake is marked and amending advice list is provided, so as to realize the automatic Proofreading of Chinese homonym.The homonym mistake auto-collation that the present invention is provided, faster system response, precision meet practical application request, and validity and accuracy are high.

Description

A kind of Chinese homonym mistake auto-collation

Technical field

The present invention relates to the natural language processing in artificial intelligence computer field, more particularly to automatic proofreading for Chinese texts Field.

Background technology

With the information processing technology and the high speed development of internet, traditional text work almost all is taken by computer In generation, e-text, blog, the microblogging such as e-book, electronic newspaper, Email, office document etc. all turn into people's daily life A part, but in text mistake it is also more and more, this brings very big challenge to proof-reading.Traditional artificial school , intensity low to efficiency is big, cycle length obviously can not meet the demand of text proofreading.

Text automatic Proofreading is one of main application of natural language processing, is also the problem of natural language understanding.Chinese It is to be input to by input method in computer, as increasing people using spelling input method inputs Chinese character, and Pinyin Input Method can be with input word and phrase, therefore occurs increasing homonym mistake in the text, and homonym mistake is to belong to true word The category of mistake.The auto-collation of the true word mistake of Chinese has problems with：

1) word for occurring true word mistake is also the word in dictionary, and this is the difficult point of automatic proofreading for Chinese texts.

2) true word mistake can disturb the syntax and semantics of whole sentence, consequently found that true word mistake need many knowledge with Resource.

3) Sparse is a main obstacle of the automatic Proofreading of true word mistake；

4) homonym automatic Proofreading includes automatic errordetecting and automatic error-correcting, and automatic errordetecting is to find that the homonym in sentence is wrong By mistake, and automatic error-correcting is to be proofreaded that there is provided amending advice to the mistake in sentence.And many methods are all to look into automatically at present Wrong and two stages of automatic error-correcting separate.

For above-mentioned Railway Project, the present invention proposes and realizes the automatic errordetecting of Chinese homonym mistake and automatic school To method.

The content of the invention

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention provides a kind of Chinese homonym mistake certainly Dynamic proofreading method, is a kind of method for integrating automatic errordetecting and automatic Proofreading.

Technical scheme：

In order to solve the above technical problems, the present invention provides a kind of Chinese homonym mistake auto-collation, it is based on same Sound word obscures the adjacent NGram models combination determining method of the part of collection and Weight and carries out Chinese homonym mistake automatic Proofreading, should Method comprises the following steps：

1) by phonetic transcriptions of Chinese characters, the homonym for setting up Chinese word obscures collection；

2) set up that the second from left is first, right binary and ternary local adjacent NGram models, based on step 1) obtained homonym mixes Confuse collection, carries out probability Estimation to the local adjacent NGram models by Probabilistic estimation, is trained by large-scale corpus To local adjacent NGram models；

3) it is based on step 2) obtained adjacent NGram models of part, using the combined method of Weight, by calculating sentence In word and its corresponding homonym obscure context support of the homonym of concentration in sentence, judge whether unisonance Word mistake, and homonym mistake is marked and amending advice list is provided.

It is preferred that, the step 1) include：Using phonetic transcriptions of Chinese characters table and Chinese dictionary, generation homonym obscures collection：

Wherein W_iIt is a Chinese word,It is W_iHomonym.

It is preferred that, the step 1) in homonym obscure collection source include two parts：Automatic identification part and artificial school To part；

Wherein automatic identification part comprises the following steps：

Step 11) Chinese dictionary is read, the Chinese word in Chinese dictionary is read into Chinese word structure；

Step 12) phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table is read in into phonetic transcriptions of Chinese characters structure；

Step 13) combine step 11) obtained Chinese word and step 12) obtained phonetic transcriptions of Chinese characters, by Chinese dictionary Chinese word is converted into phonetic, is put into generation homonym dictionary configuration, i.e. homonym in unisonance word structure and obscures collection；

Wherein artificial check and correction part includes：To step 13) obtained homonym obscures collection and manually proofreaded, renewal unisonance Word obscures collection；

The structure that the homonym obscures collection is as follows：

Wherein W_iIt is a word,It is W_iHomonym.

It is preferred that, the step 2) comprise the following steps：

Step 21) extensive Web language materials are based on, set up the local adjacent of left adjacent binary, right adjacent binary and adjacent ternary Connect NGram models：Participle is carried out to the sentence in language material, such as carrying out participle to sentence L obtains L=W₁W₂..W_i-1W_iW_i+1… W_n, for word W_iFor,

Left adjacent binary is：LeftBiGram(W_i)=W_i-1W_i；

Right adjacent binary is：RightBiGram(W_i)=W_iW_i+1；

Abutting ternary is：TriGram(W_i)=W_i-1W_iW_i+1；

Step 22) collection CSet (W are obscured based on extensive Web language materials and homonym_i), it is all that statistics homonym obscures concentration Left adjacent binary and its co-occurrence frequency of word, right adjacent binary and its co-occurrence frequency and adjacent ternary and its co-occurrence frequency, its In It is W_iHomonym；

Step 23) collection CSet (W are obscured based on homonym_i), to the office of left adjacent binary, right adjacent binary and adjacent ternary Portion's adjoining NGram models carry out probability Estimation, so as to generate the adjacent NGram models of the part comprising probabilistic estimated value；Wherein

The probability Estimation of left adjacent binary is：

The probability Estimation of right adjacent binary is：

The probability Estimation of adjacent ternary is：

Wherein W_k∈CSet(W_i), Count (W_i-1W_i) represent W_i-1W_iThe co-occurrence frequency in language material, Count (W_iW_i+1) Represent W_iW_i+1The co-occurrence frequency in language material, Count (W_i-1W_iW_i+1) represent W_i-1W_iW_i+1The co-occurrence frequency in language material,RepresentThe co-occurrence frequency in language material,RepresentIn language material The co-occurrence frequency,RepresentThe co-occurrence frequency in language material.

It is preferred that, the step 3) comprise the following steps：

Step 31) to the sentence S progress participles of application this method, travel through the word W in the sentence S after participle_i, judge whether There is homonym and obscure collection CSet (W_i), to wherein exist homonym obscure collection word carry out step 32) processing, until sentence In word be traversed；

Step 32) if W_iThere is CSet (W_i), based on step 2) obtained adjacent NGram models of part, utilize Weight Combined method, context of the homonym of concentration in sentence is obscured by the word and its corresponding homonym that calculate in sentence Support, judges whether homonym mistake, and homonym mistake is marked and amending advice list is provided, specific bag Include：

Step 32-1) by combining the word W in scoring function Score calculating sentences S_iContext support in sentence For

Score(W_i)=α₁*P_left(W_i|W_i-1)+α₂*P_right(W_i|W_i+1)+α₃*P_tri(W_i|W_i-1W_i+1)(4)；

Wherein α₁+α₂+α₃=1, α₁>0,α₂>0,α₃>0, α₁、α₂、α₃Left adjacent binary, right adjacent binary, neighbour are represented respectively Connect the weight of ternary；

Step 32-2) by combining the word W in scoring function Score calculating sentences S_iCorresponding homonym obscures concentration Each homonymContext support in sentence is

Wherein：

Step 32-3) to W_iAnd its in CSet (W_i) in the context support Score of each homonym arranged Sequence；

Step 32-4) if Score (W_i)=0, then to W_iMarked erroneous, and list in Score sequencing tables'sIt is used as amending advice list；Otherwise step 32-5 is turned to)；

Step 32-5) if Score (W_i)>0, andTo W_i Marked erroneous, is listed file names with It is correspondingAs amending advice list, W is otherwise marked_iFor Correct word, wherein β is the wrong probability into its homonym of a word.

It is used as preferred, above-mentioned steps 32-1) and step 32-2) in, the weight α of left adjacent binary₁=0.25, right adjoining The weight α of binary₂=0.25, the weight α of adjacent ternary₃=0.5.

It is preferred that, the step 32-5) in word it is wrong into probability values≤0.01 of its homonym.

Beneficial effect：The present invention proposes a kind of auto-collation of Chinese homonym mistake, and this method uses unisonance Word obscures collection and left adjacent binary, right adjacent binary and the adjacent triple combination method of Weight are carried out to the homonym in sentence Judge, recognize homonym mistake, and provide the amending advice of homonym mistake, integrate automatic errordetecting and automatic Proofreading. Experiment shows that the method recall rate for the homonym mistake automatic Proofreading that the present invention is provided reaches 81.2%, and precision reaches 75.6%, Faster system response, precision meet practical application request, and validity and accuracy are high, with higher practicality.

Brief description of the drawings

Fig. 1 homonym mistake automatic Proofreading flow charts.

Embodiment

The present invention is further described with reference to the accompanying drawings and examples.

A kind of Chinese homonym mistake auto-collation that the present invention is provided obscures collection and Weight based on homonym Local adjacent NGram models combination determining method carries out Chinese homonym mistake automatic Proofreading, and this method comprises the following steps： 1)、 Set up homonym and obscure collection, to Chinese word by phonetic transcriptions of Chinese characters, the homonym for setting up Chinese word obscures collection.

As shown in figure 1, using phonetic transcriptions of Chinese characters table and Chinese dictionary, generation homonym obscures collection：

Wherein W_iIt is a Chinese word,It is W_iHomonym.

In the present embodiment homonym obscure collection source include two parts：Automatic identification part and artificial check and correction part；

Wherein automatic identification part comprises the following steps：

Wherein artificial check and correction part includes：To step 13) obtained homonym obscures collection and manually proofreaded, renewal unisonance Word obscures collection.

2), set up that the second from left is first, right binary and ternary local adjacent NGram models, based on step 1) obtained homonym Obscure collection, probability Estimation is carried out to the local adjacent NGram models by Probabilistic estimation, trained by large-scale corpus Obtain local adjacent NGram models.Specially：

Left adjacent binary is：LeftBiGram(W_i)=W_i-1W_i；

Right adjacent binary is：RightBiGram(W_i)=W_iW_i+1；

Abutting ternary is：TriGram(W_i)=W_i-1W_iW_i+1；

The probability Estimation of left adjacent binary is：

The probability Estimation of right adjacent binary is：

The probability Estimation of adjacent ternary is：

Wherein W_k∈CSet(W_i), Count (W_i-1W_i) represent W_i-1W_iThe co-occurrence frequency in language material, Count (W_iW_i+1) Represent W_iW_i+1The co-occurrence frequency in language material, Count (W_i-1W_iW_i+1) represent W_i-1W_iW_i+1The co-occurrence frequency in language material,Represent W_i- The co-occurrence frequency in language material,RepresentW_i+1In language material The co-occurrence frequency,RepresentThe co-occurrence frequency in language material.

3), based on step 2) obtained adjacent NGram models of part, using the combined method of Weight, by calculating sentence Word and its corresponding homonym in son obscure context support of the homonym of concentration in sentence, judge whether same Sound word mistake, and homonym mistake is marked and amending advice list is provided.As shown in figure 1, being specially：

Wherein α₁+α₂+α₃=1, α₁>0,α₂>0,α₃>0, α₁、α₂、α₃Left adjacent binary, right adjacent binary, neighbour are represented respectively Connect the weight of ternary；In the present embodiment, α₁=α₂=0.25, α₃=0.5, naturally it is also possible to suitably adjusted according to actual needs It is whole.

Wherein

Step 32-5) if Score (W_i)>0, and To W_i Marked erroneous, is listed file names with It is correspondingAs amending advice list, W is otherwise marked_iFor Correct word, wherein β is the wrong probability into its homonym of a word, usual β≤0.01, β=0.01 in the present embodiment.

Experiment：

Repeatedly open test is lived through, experiment is used in the testing material of 10,000 row sentences, manual construction language material sentence At homonym error 6 00, the parameter given using in embodiment is experiment parameter.Experiment shows that the homonym that the present invention is provided is wrong The method recall rate of automatic Proofreading reaches 81.2% by mistake, and precision reaches 75.6%.This precision has exceeded prior art, reaches The demand of practical application, with higher validity and accuracy.

It is only presently preferred embodiments of the present invention to implement row above, does not constitute restriction to the present invention, relevant staff is not Deviate in the range of the technology of the present invention thought, any modification, equivalent substitution and improvements carried out etc. all fall within the guarantor of the present invention In the range of shield.

Claims

1. a kind of Chinese homonym mistake auto-collation, it is characterised in that the office of collection and Weight is obscured based on homonym Adjoining NGram model combination determining methods in portion carry out Chinese homonym mistake automatic Proofreading, and this method comprises the following steps：

2) set up that the second from left is first, right binary and ternary local adjacent NGram models, based on step 1) obtained homonym obscures Collection, carries out probability Estimation to the local adjacent NGram models by Probabilistic estimation, is obtained by large-scale corpus training Local adjacent NGram models；

3) it is based on step 2) obtained adjacent NGram models of part, using the combined method of Weight, by calculating in sentence Word and its corresponding homonym obscure context support of the homonym of concentration in sentence, judge whether that homonym is wrong Miss, and homonym mistake is marked and amending advice list is provided；

Wherein described step 2) comprise the following steps：

Step 21) extensive Web language materials are based on, the part for setting up left adjacent binary, right adjacent binary and adjacent ternary is adjacent NGram models：Participle is carried out to the sentence in language material, such as carrying out participle to sentence L obtains L=W₁W₂..W_i-1W_iW_i+1…W_n, it is right In word W_iFor,

Left adjacent binary is：LeftBiGram(W_i)=W_i-1W_i；

Right adjacent binary is：RightBiGram(W_i)=W_iW_i+1；

Abutting ternary is：TriGram(W_i)=W_i-1W_iW_i+1；

Step 22) collection CSet (W are obscured based on extensive Web language materials and homonym_i), statistics homonym, which is obscured, concentrates all words Left adjacent binary and its co-occurrence frequency, right adjacent binary and its co-occurrence frequency and adjacent ternary and its co-occurrence frequency, wherein It is W_iHomonym；

Step 23) collection CSet (W are obscured based on homonym_i), to the local adjacent of left adjacent binary, right adjacent binary and adjacent ternary Connect NGram models and carry out probability Estimation, so as to generate the adjacent NGram models of the part comprising probabilistic estimated value；Wherein left adjoining The probability Estimation of binary is：

The probability Estimation of right adjacent binary is：

The probability Estimation of adjacent ternary is：

WhereinCount(W_i-1W_i) represent W_i-1W_iThe co-occurrence frequency in language material, Count (W_iW_i+1) represent W_iW_i+1The co-occurrence frequency in language material, Count (W_i-1W_iW_i+1) represent W_i-1W_iW_i+1The co-occurrence frequency in language material,RepresentThe co-occurrence frequency in language material,RepresentIn language material The co-occurrence frequency,RepresentThe co-occurrence frequency in language material；

Wherein described step 3) comprise the following steps：

Step 31) to the sentence S progress participles of application this method, travel through the word W in the sentence S after participle_i, judge whether Homonym obscures collection CSet (W_i), to wherein exist homonym obscure collection word carry out step 32) processing, until in sentence Word has been traversed；

Step 32) if W_iThere is CSet (W_i), based on step 2) obtained adjacent NGram models of part, utilize the group of Weight Conjunction method, obscures context of the homonym of concentration in sentence by the word and its corresponding homonym that calculate in sentence and supports Degree, judges whether homonym mistake, and homonym mistake is marked and amending advice list is provided, and specifically includes：

Step 32-1) by combining the word W in scoring function Score calculating sentences S_iContext support in sentence is

Score(W_i)=α₁*P_left(W_i|W_i-1)+α₂*P_right(W_i|W_i+1)+α₃*P_tri(W_i|W_i-1W_i+1) (4)；

Wherein α₁+α₂+α₃=1, α₁>0,α₂>0,α₃>0, α₁、α₂、α₃Left adjacent binary, right adjacent binary, adjacent three are represented respectively The weight of member；

Step 32-2) by combining the word W in scoring function Score calculating sentences S_iCorresponding homonym obscures each of concentration Individual homonymContext support in sentence is

Wherein

Step 32-3) to W_iAnd its in CSet (W_i) in the context support Score of each homonym be ranked up；

Step 32-5) if Score (W_i)>0, andTo W_iMark is wrong By mistake, list file names withIt is correspondingAs amending advice list, W is otherwise marked_iTo be correct Word, wherein β are the wrong probability into its homonym of a word.

2. Chinese homonym mistake auto-collation according to claim 1, it is characterised in that the step 1) include： Using phonetic transcriptions of Chinese characters table and Chinese dictionary, generation homonym obscures collection：

Wherein W_iIt is a Chinese word,It is W_iHomonym.

3. Chinese homonym mistake auto-collation according to claim 1, it is characterised in that the step 1) in it is same The source that sound word obscures collection includes two parts：Automatic identification part and artificial check and correction part；

Wherein automatic identification part comprises the following steps：

Step 13) combine step 11) obtained Chinese word and step 12) obtained phonetic transcriptions of Chinese characters, by the Chinese in Chinese dictionary Word is converted into phonetic, is put into generation homonym dictionary configuration, i.e. homonym in unisonance word structure and obscures collection；

Wherein artificial check and correction part includes：To step 13) obtained homonym obscures collection and manually proofreaded, and updates homonym mixed Confuse collection；

The structure that the homonym obscures collection is as follows：

Wherein W_iIt is a word,It is W_iHomonym.

4. Chinese homonym mistake auto-collation according to claim 1, it is characterised in that：The step 32-1) With step 32-2) in, the weight α of left adjacent binary₁=0.25, the weight α of right adjacent binary₂=0.25, the weight of adjacent ternary α₃=0.5.

5. Chinese homonym mistake auto-collation according to claim 1, it is characterised in that：The step 32-5) In word it is wrong into probability values≤0.01 of its homonym.