CN106598947A

CN106598947A - Bayesian word sense disambiguation method based on synonym expansion

Info

Publication number: CN106598947A
Application number: CN201611157518.1A
Authority: CN
Inventors: 杨陟卓; 张虎; 李茹; 陈千; 谭红叶
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2016-12-15
Filing date: 2016-12-15
Publication date: 2017-04-26

Abstract

The invention belongs to the technical field of natural language processing methods, and in particular relates to a Bayesian word sense disambiguation method based on synonym expansion. The Bayesian word sense disambiguation method disclosed by the invention is used for mainly solving the problem that the current word sense disambiguation method has the problems of poor disambiguation effect, wasting time and energy to obtain disambiguation knowledge and the like. The Bayesian word sense disambiguation method based on synonym expansion disclosed by the invention comprises the following steps of: (1), expanding the context of a training corpus by adopting the Chinese thesaurus, and generating a lot of pseudo training corpuses; (2), removing noise in the pseudo training corpuses by utilizing a word collocation corpus, and generating a pseudo training corpus; (3), training a Bayesian disambiguation model by adopting the training corpus and the pseudo training corpus simultaneously; and (4), inputting a test corpus into the Bayesian disambiguation model, and co-determining word senses of ambiguous words by comprehensively utilizing the disambiguation knowledge in the two corpuses.

Description

A kind of Bayes's Word sense disambiguation method extended based on synonym

Technical field

The invention belongs to natural language processing method and technology field, specifically, it is related to a kind of extend based on synonym Bayes's Word sense disambiguation method.

Technical background

Word sense disambiguation (Word Sense Disambiguation, WSD) refers to that determination polysemant is specific in natural language Meaning in context, it is a key problem of natural language processing field.During machine understands natural language, When an ambiguity word occurs in a specific context, the Ambiguity of word just occurs, especially " information is quick-fried currently It is fried " Internet era, the ambiguity problem of vocabulary just seems more serious.Either Chinese or western language, polysemy Phenomenon generally existing.Statistical study shows, in Large Scale Corpus, what Chinese text and English text occurred in language material Ambiguity word frequency rate reaches 40% or so.The ambiguity word of very high frequency has had a strong impact on machine just understanding and locate to natural language Reason, the problem has been one of its greatest difficulty for facing.The development of the technology, can greatly facilitate such as language identification, sentence The development of the natural language processing fields such as method analysis, information retrieval, machine translation, text-processing.

At present, the Word sense disambiguation method based on corpus can be divided into supervision and unsupervised approaches.Unsupervised approaches are not required to Corpus are wanted, but the disambiguation effect of the method is not fully up to expectations, is extremely difficult to practical purpose.There is the disambiguation effect of measure of supervision Fruit will be substantially better than unsupervised approaches, but the method needs extensive high-quality corpus to support, and obtain extensive high The corpus of quality are wasted time and energy, and have seriously hindered supervision Word sense disambiguation method large-scale application.In order to solve this problem, Many scholars begin one's study and automatically generate the method for marking language material.The method is generally first with dictionary and on a large scale without mark Corpus has automatically generated labeled data, then using there is measure of supervision to train disambiguation model, carries out disambiguation.

The content of the invention

Present invention is generally directed to current Word sense disambiguation method is present, disambiguation effect is poor, obtain disambiguation knowledge wastes time and energy A kind of problem, there is provided Bayes's Word sense disambiguation method extended based on synonym.

The technical scheme taken to solve the above problems of the present invention is：

A kind of Bayes's Word sense disambiguation method extended based on synonym, is comprised the following steps：

Step 1, the context of training corpus is extended using Chinese thesaurus, generates a large amount of puppet corpus；

Step 2, the noise removed using collocations corpus in puppet corpus, generate pseudo- training corpus；

Step 3, while using training corpus and pseudo- training corpus training Bayes's disambiguation model；

Step 4, testing material is input into into Bayes's disambiguation model, comprehensively utilizes the disambiguation knowledge in two kinds of corpus, altogether With the meaning of a word of decision-making ambiguity word.

Further, step 1 of the present invention is concretely comprised the following steps：First, little rule are set up by the way of artificial mark Mould word sense disambiguation training corpus, then using Chinese thesaurus, is extended, most to the context in sentence residing for ambiguity word Afterwards by the synonym after extension, ambiguity word and in this ambiguity word the meaning of a word, generate a large amount of puppet corpus.

Step 2 of the present invention is concretely comprised the following steps：The context of ambiguity word is extended using Chinese thesaurus, for expanding The context of exhibition, statistics and ambiguity word co-occurrence number of times in collocations corpus, only using upper with certain co-occurrence number of times Hereafter, pseudo- training corpus is built.

Simultaneously using training corpus and pseudo- training corpus training Bayes's disambiguation model in step 3 of the present invention, Computing formula is：

In formula, s_iRepresent the ambiguity word meaning of a word, w_-L...w_LRepresent ambiguity word w₀Word under neighbouring certain window size L, f_jCertain contextual feature of ambiguity word is represented, F represents the characteristic set of context, p (f_j|s_i) represent the bar of the meaning of a word and feature Part probability, calculates as formula is：

c(s_i) represent meaning of a word s_iThe number of times occurred in corpus, c (f_j,s_i) represent feature f_jWith meaning of a word s_iIn training language Co-occurrence number of times in material.

Step 4 of the present invention is concretely comprised the following steps：The language piece that the context extended by Chinese thesaurus is constituted Section, when corpus are faked, comprehensively utilizes the knowledge in training corpus and pseudo- training corpus, carries out word sense disambiguation, is estimating During the conditional probability of the meter meaning of a word and feature, calculated by below equation：

C in formula_t(f_j,s_i) represent meaning of a word s_iWith feature f_jCo-occurrence number of times in corpus, c_t(s_i) represent meaning of a word s_i Occurrence number in corpus, c_p(f_j,s_i) represent feature and co-occurrence number of times of the ambiguity word in pseudo- corpus, c_p(s_i) Represent meaning of a word s_iThe occurrence number in pseudo- corpus, the value of λ is 0.7.

The present invention adopts above-mentioned technical proposal, using Chinese thesaurus, the context of ambiguity word in corpus is carried out Extension, the language fragments that the synonym after extension is constituted are similar to the meaning that former context is stated, and generate pseudo- corpus Storehouse.Then the noise in pseudo- training corpus is removed using collocations corpus, followed by training corpus and pseudo- training Corpus, trains Bayes's disambiguation model, finally, using the meaning of a word of the disambiguation model decision ambiguity word.Specifically：

1st, the context for training example is extended using Chinese thesaurus, generates pseudo- training corpus.The present invention is adopted Context residing for ambiguity word under certain window size is extended with Chinese thesaurus, for word sense disambiguation more knowledge are provided.Due to, The language fragments that the synonym of context is constituted are similar to the meaning of former context language fragment expression.Meanwhile, occur in this The same meaning that ambiguity vocabulary in two kinds of similar contexts language fragments reaches.Therefore, it can in sentence residing for ambiguity word Context is extended, then by the context after extension, ambiguity word and in this ambiguity word the meaning of a word, collectively form pseudo- instruction Practice corpus.For example：" work of whole unit team members has been completed ambiguity sentences.", wherein ambiguity word " unit " is in Modern Chinese In the meaning of a word have two, be respectively " personnel " and " machine ".Speculate from the context residing for ambiguity word, the word of ambiguity word " unit " Justice is " personnel ".The sentence is carried out into the process of synonym extension, can be represented with Fig. 1.

Word near ambiguity word, the impact to the ambiguity word meaning of a word is maximum, therefore, only list in figure near on Hereafter word " whole ", " team member " and " work ", and only synonym extension is carried out to these context words.As schemed Show, each TongYiCi CiLin only lists 4 synonyms of respective contexts word, can be combined into again using these synonyms Multiple sentences comprising ambiguity word, such as " whole crew's tasks ", " all unit group member responsibilities ", " all units team member It is full-time " etc..In the language fragments combined by synonym, the meaning of a word of ambiguity word " unit " is still represented " personnel ".With Upper synonym, ambiguity word " unit " and the ambiguity word meaning of a word " personnel " collectively form pseudo- training corpus.

2nd, the noise in pseudo- corpus is removed using meaning of a word collocation corpus.Context words are carried out it is synon During extension, 2 problems are run into：(1) it is not that all synonyms of context words are suitable for for expanding, builds new training Language material.The synonym for for example " working " includes " sole duty ", " bounden duty " etc..And this 2 words are not appropriate for adding the language for expanding In material storehouse, because in daily life such word combination is rarely employed.(2) under many circumstances, context words are not It is univocal, equally exists the problem of ambiguity, the word that for example " works " there are 2 basic meaning of a word, is respectively that 1. " task " 2. " is grasped Make " " doing things ".In order to solve this 2 problems, the present invention limits synon extension using the relation of collocations.Specifically do Method is, only using the word training disambiguation model with ambiguity word with certain co-occurrence number of times.By the restriction of collocations number of times, Collocations rarely needed in daily Chinese not only can be filtered out, and can largely be solved due to upper and lower cliction The ambiguity of language and caused noise word problem.

3rd, while using training corpus and pseudo- training corpus training Bayes's disambiguation model.

The present invention carries out disambiguation training using Bayesian model, as shown in Equation 1.

In formula (1), s_iRepresent the ambiguity word meaning of a word, w_-L...w_LRepresent ambiguity word w₀Word under neighbouring certain window size L Language, f_jCertain contextual feature of ambiguity word is represented, F represents the characteristic set of context, p (f_j|s_i) represent the meaning of a word with feature Conditional probability, calculates as shown in formula (2).

In formula 2, c (s_i) represent meaning of a word s_iThe number of times occurred in corpus, c (f_j,s_i) represent feature f_jWith meaning of a word s_i Co-occurrence number of times in corpus.

4th, training corpus and pseudo- training corpus, the meaning of a word of codetermination ambiguity word are comprehensively utilized.

The language fragments that the present invention is constituted the synonym expanded by context words, build pseudo- training corpus, comprehensive Close using training corpus and pseudo- training corpus, carry out word sense disambiguation.Pseudo- training corpus compared with artificial language material, also one Fixed noise, should play relatively small effect.When meaning of a word decision-making is carried out using Bayesian formula (1), by two types The language material codetermination meaning of a word, when the conditional probability of ambiguity word and feature is estimated, calculated using formula (3)：

In formula 3, c_t(f_j,s_i) represent meaning of a word s_iWith feature f_jCo-occurrence number of times in corpus, c_t(s_i) represent word Adopted s_iOccurrence number in corpus.c_p(f_j,s_i) represent feature and co-occurrence number of times of the ambiguity word in pseudo- corpus, c_p (s_i) represent meaning of a word s_iThe occurrence number in pseudo- corpus.λ is used to adjust impact of the pseudo- corpus to the ambiguity word meaning of a word, takes It is worth for 0.7.

In a word the present invention is a kind of practical Chinese word sense disambiguation technology, can on the basis of it need not manually mark, Effectively alleviate the Sparse Problem that word sense disambiguation is faced, improve the accuracy rate of word sense disambiguation, the method has wide sending out Exhibition prospect, can greatly facilitate the natural languages such as information retrieval, machine translation, language identification, syntactic analysis, text-processing The development of process field, compared with Word sense disambiguation method is supervised in traditional having, accuracy rate improves 4.35 percentage points.

Description of the drawings

Word sense disambiguation methods of the Fig. 1 based on synonym extension；

Fig. 2 is the overall structure diagram of the present invention.

Specific embodiment

Embodiment 1

Fig. 1 is the schematic diagram of the whole method of the present invention, and below in conjunction with example specific implementation is given." all< word>Unit</word>Team member's work has been completed " for corpus, sentence "<word>Unit</word>Personnel do not send out Go out distress signal " it is testing material, disambiguation process is carried out to the ambiguity word " unit " of testing material.

Whole implementation process is as follows：

(1) context of training corpus is extended using Chinese thesaurus, generates pseudo- corpus.

Using Chinese thesaurus ambiguity sentences are extended respectively, obtain context TongYiCi CiLin.Such as " all, institute Have, all, it is whole ", " personnel, group member, team member, party member ", " task, responsibility, sole duty, bounden duty " etc..Above TongYiCi CiLin, Ambiguity word " unit " and the ambiguity word meaning of a word " personnel " collectively form pseudo- corpus.

(2) noise in pseudo- corpus is removed using collocations corpus.

For problem of the language material comprising some noises that step (1) is produced, the present invention is removed using collocations corpus The noise of pseudo- corpus.Only pseudo- training corpus will be added with the synonym of certain co-occurrence number of times.Co-occurrence frequency threshold value mesh Front value is 25, and using synonym of the co-occurrence number of times more than 25 pseudo- training corpus is built.Therefore, " party member ", " sole duty " and " my god Duty " these synonyms will be filtered, in being added without pseudo- corpus.

(3) while using training corpus and pseudo- training corpus training Bayes's disambiguation model.

According to formula (2), the co-occurrence number of times of the meaning of a word (" personnel " and " machine ") and feature is counted.Feature f_jNot only include Context (" entirety ", " unit ", " team member " " task ") in corpus, also including the context synonym removed after noise (" all ", " all ", " whole ", " personnel ", " group member ", " responsibility ").Co-occurrence number of times c_t(f_j,s_i) and c_p(f_j,s_i) in training Statistical result in corpus is as shown in table 1:

The co-occurrence number of times of feature and the meaning of a word in the training corpus of table 1

Additionally, each meaning of a word s_iThe number of times c occurred in training corpus and pseudo- training corpus_t(s_Personnel)=10, c_t (s_Machine)=8 and c_p(s_Personnel)=10, c_p(s_Machine)=8.

(4) test case and the input of context synonym there are into monitor model, comprehensive utilization training corpus and pseudo- training Corpus, the meaning of a word of codetermination ambiguity word.

Contextual feature " personnel ", " sending " in testing material, " emergency " " signal ", find out respectively ambiguity word word The co-occurrence number of times of adopted " personnel " and " machine " under these features, as shown in table 2.

The co-occurrence number of times of the testing material of table 2

λ value in formula (3) estimates that value is 0.7 in corpus.Probability is calculated using formula (3)WithIn addition, the probability that the ambiguity word meaning of a word occurs isWithAccording to formula (1), calculate Therefore, the ambiguity word meaning of a word of maximum of probability is " personnel ", and the meaning of a word is labeled as into the final meaning of a word of test case ambiguity word, complete Into word sense disambiguation.

The foregoing is only the preferred embodiment of the present invention.Above-mentioned embodiment is not limited to the present invention, For a person skilled in the art, the present invention can have various modifications and variations.It is all the spirit and principles in the present invention it Interior, any modification, equivalent substitution and improvements made etc. should be included within scope of the presently claimed invention.

Claims

1. it is a kind of based on synonym extend Bayes's Word sense disambiguation method, it is characterised in that comprise the following steps：

Step 4, testing material is input into into Bayes's disambiguation model, comprehensively utilizes the disambiguation knowledge in two kinds of corpus, it is common certainly The meaning of a word of plan ambiguity word.

2. it is according to claim 1 it is a kind of based on synonym extend Bayes's Word sense disambiguation method, it is characterised in that：Institute State concretely comprising the following steps for step 1：First, small-scale word sense disambiguation training corpus is set up by the way of artificial mark, then Using Chinese thesaurus, the context in sentence residing for ambiguity word is extended, finally by the synonym after extension, ambiguity word And in this ambiguity word the meaning of a word, generate a large amount of puppet corpus.

3. it is according to claim 1 it is a kind of based on synonym extend Bayes's Word sense disambiguation method, it is characterised in that step Rapid 2 concretely comprise the following steps：The context of ambiguity word is extended using Chinese thesaurus, for the context for extending, statistics With ambiguity word in collocations corpus co-occurrence number of times, only using the context with certain co-occurrence number of times, build pseudo- training Corpus.

4. it is according to claim 1 it is a kind of based on synonym extend Bayes's Word sense disambiguation method, it is characterised in that：Institute State simultaneously using training corpus and pseudo- training corpus training Bayes's disambiguation model in step 3, computing formula is：

p (s_{i} | w_{- L} ... w_{0} ... w_{L}) &Proportional; p (s_{i}) \underset{f_{j} &Element; F}{Π} p (f_{j} | s_{i}),

In formula, s_iRepresent the ambiguity word meaning of a word, w_-L...w_LRepresent ambiguity word w₀Word under neighbouring certain window size L, f_jTable Show certain contextual feature of ambiguity word, F represents the characteristic set of context, p (f_j|s_i) represent that the meaning of a word is general with the condition of feature Rate, calculates as formula is：

p (f_{j} | s_{i}) = \frac{c (f_{j}, s_{i})}{c (s_{i})},

c(s_i) represent meaning of a word s_iThe number of times occurred in corpus, c (f_j,s_i) represent feature f_jWith meaning of a word s_iIn corpus Co-occurrence number of times.

5. it is according to claim 1 it is a kind of based on synonym extend Bayes's Word sense disambiguation method, it is characterised in that：Institute State concretely comprising the following steps for step 4:The language fragments that the context extended by Chinese thesaurus is constituted, when faking training language Material, comprehensively utilizes the knowledge in training corpus and pseudo- training corpus, word sense disambiguation is carried out, in the bar for estimating the meaning of a word and feature During part probability, calculated by below equation：

p (f_{j} | s_{i}) = \frac{c_{t} (f_{j}, s_{i})}{c_{t} (s_{i})} + λ \frac{c_{p} (f_{j}, s_{i})}{c_{p} (s_{i})}

C in formula_t(f_j,s_i) represent meaning of a word s_iWith feature f_jCo-occurrence number of times in corpus, c_t(s_i) represent meaning of a word s_iIn instruction Practice the occurrence number in language material, c_p(f_j,s_i) represent feature and co-occurrence number of times of the ambiguity word in pseudo- corpus, c_p(s_i) represent Meaning of a word s_iThe occurrence number in pseudo- corpus, λ values are 0.7.