CN109359303A

CN109359303A - A word sense disambiguation method and system based on graph model

Info

Publication number: CN109359303A
Application number: CN201811503355.7A
Authority: CN
Inventors: 孟凡擎; 燕孝飞; 张强; 陈文平; 鹿文鹏
Original assignee: Zaozhuang University
Current assignee: Zaozhuang University
Priority date: 2018-12-10
Filing date: 2018-12-10
Publication date: 2019-02-19
Anticipated expiration: 2038-12-10
Also published as: CN109359303B

Abstract

The invention discloses a kind of Word sense disambiguation method and system based on graph model, belong to natural language processing technique field, the technical problem to be solved in the present invention is how to combine a variety of Chinese and English resources, have complementary advantages, realize the disambiguation knowledge sufficiently excavated in resource, promote word sense disambiguation performance, a kind of technical solution of use are as follows: 1. Word sense disambiguation method based on graph model, include the following steps: S1, extract Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, notional word is extracted as Context Knowledge, notional word is named word, verb, adjective, adverbial word；S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and based on the similarity calculation of HowNet；S3, building disambiguate figure；The correct selection of S4, the meaning of a word.2. a kind of sense disambiguation systems based on graph model, which includes Context Knowledge extraction unit, similarity calculated, disambiguation figure construction unit and the correct selecting unit of the meaning of a word.

Description

A kind of Word sense disambiguation method and system based on graph model

Technical field

The present invention relates to natural language processing technique field, specifically a kind of Word sense disambiguation method based on graph model And system.

Background technique

Word sense disambiguation refers to that the specific context environment according to locating for ambiguity word determines its specific meaning of a word, it is natural language One basic research of process field, to upper layers such as machine translation, information extraction, information retrieval, text classification, sentiment analysis Using directly affecting.The phenomenon that either other western languages such as Chinese or English, polysemy is generally existing.

Traditional method for carrying out Chinese word sense disambiguation task processing based on graph model is mainly utilized in one or more Literary knowledge resource, by the puzzlement of knowledge resource deficiency problem, word sense disambiguation performance is lower.Therefore how to combine a variety of Chinese and English moneys Source has complementary advantages, and realizes the disambiguation knowledge sufficiently excavated in resource, and promoting word sense disambiguation performance is current technology urgently to be solved Problem.

The patent document of Patent No. CN105893346A discloses a kind of graph model meaning of a word based on interdependent syntax tree and disappears Discrimination method the steps include: that 1. pairs of sentences are pre-processed and extract notional word to be disambiguated, and mainly include standardization processing, hyphenation And lemmatization etc.；2. pair sentence carries out interdependent syntactic analysis, its interdependent syntax tree is constructed；3. word is interdependent in acquisition sentence Distance on syntax tree, the i.e. length of shortest path；4. the meaning of a word concept building disambiguation for word in sentence is known according to knowledge base Know figure；5. being existed according to the semantic association path length, the weight of incidence edge, path end points that disambiguate in knowledge graph between meaning of a word node Distance on interdependent syntax tree calculates the figure score value of each meaning of a word node；6. being each ambiguity word, select figure score value maximum The meaning of a word as the correct meaning of a word.But the technical solution is using the semantic association relationship contained in BabelNet, rather than Semantic knowledge in HowNet；It is suitable for the work of English word sense disambiguation, but for Chinese and are not suitable for, and not can solve how In conjunction with a variety of Chinese and English resources, have complementary advantages, realizes the disambiguation knowledge sufficiently excavated in resource, promote asking for word sense disambiguation performance Topic.

Summary of the invention

Technical assignment of the invention is to provide a kind of Word sense disambiguation method and system based on graph model, to solve how to tie A variety of Chinese and English resources are closed, are had complementary advantages, the disambiguation knowledge sufficiently excavated in resource is realized, promotes asking for word sense disambiguation performance Topic.

Technical assignment of the invention realizes in the following manner, a kind of Word sense disambiguation method based on graph model, including Following steps:

S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word；

S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet；

S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure；

The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.

Preferably, specific step is as follows for similarity calculation in the step S2:

S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does the meaning of a word Mapping processing, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English word carries out similarity calculation；In addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire HowNet In English word information；

S202, the similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector；

S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, is adopted The form of word language vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.

More preferably, the Word similarity algorithm of word-based vector sum knowledge base is specific as follows in the step S201:

What S20101, judgement gave is word or phrase:

If 1., it is given be two English words, two words are obtained by the cosine similarity of two word vectors of calculating Similarity between language；

If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain phrase to Amount indicates, acquires the similarity of phrase, formula is as follows:

Wherein, | p₁| and | p₂| indicate phrase p₁And p₂The number of contained word；w_iAnd w_jRespectively indicate p₁In i-th of word Language, p₂In j-th of word；

S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ；

S20103, synonym is constructed based on two English words and synset relevant to two English words Collection figure；

S20104, in set distance range, the registration of synset relevant to two English words is calculated in figure, Formula is as follows:

sim_lap(w_i, w_j)=d*count (w_i, w_j)/(count(w_i)+count(w_j))

In formula, count (w_i, w_j) indicate word w_iAnd w_jThe synset number having jointly；count(w_i) and count (w_j) it is respectively w_iAnd w_jThe synset number respectively having；The value of d expression set distance range；

S20105, w in figure is calculated using dijkstra's algorithm_iAnd w_jBetween shortest path, obtain w_iAnd w_jIt is similar Degree, formula are as follows:

sim_bn(w_i, w_j1/ (δ of)=α *^path)+(1-α)sim_lap(w_i, w_j)

Wherein, path is w_iAnd w_jBetween shortest path；Value of the δ to adjust similarity；sim_lap(w_i, w_j) indicate w_iAnd w_jBetween registration；Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula；

S20106, the similarity sim for obtaining vector approach word-based in step S20101_vecWith base in step S20105 In the similarity sim that knowledge base method obtains_bn, linear, additive combination is carried out, obtains final similarity, formula is as follows:

sim_final(w_i, w_j)=β * sim_vec+(1-β)*sim_bn

Wherein, sim_bnAnd sim_vecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains The similarity arrived；Parameter alpha is a regulatory factor, adjusts the similarity that knowledge based library method and word-based vector approach obtain As a result；

S20107, similarity sim is returned_final。

Preferably, specific step is as follows for building disambiguation figure in the step S3:

S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter；

S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Wherein, ws and ws ' indicates two meaning of a word, sim_howThe similarity calculation based on HowNet is indicated as a result, weight is α； sim_enThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β；sim_vecIt indicates based on term vector Similarity calculation is as a result, weight is γ；Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0；

S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word.

More preferably, the simulated annealing in the step S301 carries out the formula of parameter optimization are as follows:

Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate；δ indicates cooling rate；T indicates current institute Locate temperature；x_newIt indicates newly to take parameter；x_oldIndicate original parameter；

The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:

If (a), newly taking parameter x_newObjective function value be not less than original parameter x_oldObjective function value, then with probability P is that 1 selection newly takes parameter x_new；

If (b), newly taking parameter x_newObjective function value be less than original parameter x_oldObjective function value, then with Probability p For exp ((result (x_new)-result(x_old))/(δ t)) it is used as Selecting All Parameters x_newFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:

If 1., the probability value that generates at random no more than p when, selection newly takes parameter x_new；

2., if the probability value that generates at random is when being greater than p, give up and newly take parameter x_new；

The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword)；Its In, what No. was indicated is concept number；Sword indicates the first sense word；Enword indicates English word；No.,Sword, Enword three is the entirety of organic unity, describes the same meaning of a word concept；A meaning of a word concept number is unique in HowNet A meaning of a word is identified, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.

Preferably, selecting the correct meaning of a word in the step S4, specific step is as follows:

S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure；It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；

S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:

If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word；

If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.

More preferably, figure scoring uses PageRank algorithm in the step S401, and PageRank algorithm is based on Ma Erke Husband's chain model assesses node in figure, and the PageRank score of a node depends on all nodes linked with it PageRank score；The specific PageRank score calculation formula of one node are as follows:

Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate；α refers to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；in It (v) is all nodes for being linked to node v.

A kind of sense disambiguation systems based on graph model, the system include,

Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word；

Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet；

Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure；

The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.

Preferably, the similarity calculated includes:

English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation；In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information；

Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two；It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word；

HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree；

The disambiguation figure construction unit includes,

Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter；The formula of simulated annealing progress parameter optimization are as follows:

Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word；Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword)；Wherein, what No. was indicated is concept number；Sword indicates that the first justice is former Word；Enword indicates English word；No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads；Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.

More preferably, the correct selecting unit of the meaning of a word includes,

Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure； After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it；The specific PageRank of one node Score calculation formula are as follows:

Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate；α refers to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；in It (v) is all nodes for being linked to node v；

Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:

Of the invention Word sense disambiguation method and system based on graph model has the advantage that

(1), for the present invention by combining a variety of Chinese and English resources, the disambiguation knowledge in resource is sufficiently excavated in mutual supplement with each other's advantages, Facilitate the promotion of word sense disambiguation performance；

(2), the present invention does the similarity calculation based on English, the similarity calculation based on term vector respectively and is based on The similarity calculation of HowNet, it is ensured that a variety of knowledge resources can be effectively integrated, improve and disambiguate accuracy rate；

(3), the present invention carries out weight optimization to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disambiguates figure, Guarantee the similarity value of a variety of knowledge resources of Automatic Optimal；

(4), when the present invention carries out English similarity calculation, HowNet word sense information mark is carried out to Context Knowledge, and Meaning of a word mapping processing is done, obtains English set of words, it is ensured that being capable of automatic aligning Chinese and English knowledge resource；

(5), the present invention gives a mark to the meaning of a word candidate in figure by figure scoring, and then obtains the score column of the candidate meaning of a word Table selects score the maximum for the correct meaning of a word, can realize the correct meaning transference to target ambiguities word automatically.

Detailed description of the invention

The following further describes the present invention with reference to the drawings.

Attached drawing 1 is the flow diagram of the Word sense disambiguation method based on graph model；

Attached drawing 2 is the flow diagram of similarity calculation；

Attached drawing 3 is the flow diagram that building disambiguates figure；

Attached drawing 4 is the flow diagram of correct meaning transference；

Attached drawing 5 is the structural block diagram of the word sense disambiguation based on graph model；

Attached drawing 6 is the word sense information figure of citing Chinese medicine word；

Attached drawing 7 is the synset figure that constructs in the Word similarity algorithm of word-based vector sum knowledge base.

Specific embodiment

To a kind of Word sense disambiguation method based on graph model of the invention and it is referring to Figure of description and specific embodiment System is described in detail below.

Embodiment:

As shown in Fig. 1, the Word sense disambiguation method and system of the invention based on graph model, includes the following steps:

Citing: with to " carrying out around " instruction ", in conjunction with the reality of work of Chinese medicine, increased force is wanted in various regions Degree, actively and steadily promotes TCM medical organization to reform." processing for, wherein " Chinese medicine " is to disambiguated term.Part-of-speech tagging Processing uses Chinese Academy of Sciences's Words partition system NLPIR-ICTCLAS.After part-of-speech tagging, " around/v "/wkz guidance/v opinion/n "/ Wky /ude1 implements/vn implements/vn ,/wd combination/v traditional Chinese medicine/n work/vn/ude1 reality/n ,/wd is each Ground/rzs wants/v increasing/v dynamics/n ,/wd actively/a and/cc it is safe/a /ude2 propulsion/vi Chinese medicine/n doctor Treatment/n mechanism/n reform/vn./ wj ", it is extracted notional word go forward side by side row format arrangement, to facilitate subsequent processing, obtain " Chinese medicine _ N_25: implement around _ v_0 guidance _ vn_2 opinion _ n_3 _ v_6 implements _ v_7 combination _ v_9 traditional Chinese medicine _ n_10 work _ Vn_11 reality _ n_13 wants _ v_16 increasing _ v_17 dynamics _ n_18 actively _ a_20 is safe _ a_22 propulsion _ v_24 in Doctor _ n_25 medical treatment _ n_26 mechanism _ n_27 reform _ vn_28 ", is wherein word to be disambiguated before colon, and the number after part of speech is single Word is the location of in sentence.

As shown in Fig. 2, specific step is as follows for similarity calculation:

Similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carry out similarity calculation, in addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information.The part main code of the Word similarity algorithm of word-based vector sum knowledge base is as follows:

In the Word similarity algorithm of word-based vector sum knowledge base, row 1 gives two English words, they Between similarity, obtained by both calculating the cosine similarity of term vector, if given word is phrase, by training institute Term vector in there is no phrase, need that phrase is further processed, by by the corresponding term vector of word in phrase It is added, the vector for obtaining phrase indicates, and then acquires the similarity of phrase, and formula is as follows:

Wherein, | p₁| and | p₂| indicate phrase p₁And p₂The number of contained word；w_iAnd w_jRespectively indicate p₁In i-th of word Language, p₂In j-th of word.

Row 2-4 iteratively searches for synset relevant to word w1 and w2, until iterative steps step is more than γ, by The cost that figure calculates when node is excessive is larger, therefore sets 10 for greatest iteration step number γ；Row 5, with w1, w2 and they it Between associated synset be basic structure figures；Row 6 in figure within the scope of certain distance, calculates relevant to w1 and w2 same The registration of adopted word set, set distance 2, formula is as follows:

sim_lap(w1, w2)=2*count (w1, w2)/(count (w1)+count (w2))

In formula, count (w1, w2) indicates the synset number that word w1 and w2 have jointly；Count (w1) and Count (w2) is respectively the synset number that w1 and w2 respectively have.

Row 7 calculates the shortest path in figure between w1 and w2 using dijkstra's algorithm, further obtains the phase of w1 and w2 Like degree, formula is as follows:

sim_bn1/ (δ of (w1, w2)=α *^path)+(1-α)sim_lap(w1, w2)

Wherein, path is the shortest path between w1 and w2；Value of the δ to adjust similarity, is set as 1.4；sim_lap (w1, w2) indicates the registration between w1 and w2；Parameter alpha is a regulatory factor, for adjusting the phase of two parts in formula Like angle value.

The method of the above-mentioned method based on term vector and knowledge based library (BabelNet) is carried out linear, additive knot by row 8 It closes, obtains final similarity, formula is as follows:

sim_final(w1, w2)=β * sim_vec+(1-β)*sim_bn

sim_bnAnd sim_vecRespectively indicate the similarity that the method in knowledge based library and the method based on term vector obtain；Ginseng Number α is a regulatory factor, is obtained for adjusting two methods as a result, being specifically configured to 0.6.

Row 9 returns to similarity sim_final。

The processing in relation to term vector is to utilize in the Word similarity algorithm of word-based vector sum knowledge base Word2vec kit, on without mark English Wikipedia corpus, training term vector.Before training, data are carried out Pretreatment, by file format is converted to UTF-8 by Unicode.Training window is set as 5, and default vector dimension is set as 200, model selects Skip-gram.After training terminates, a term vector file is obtained, hereof, each word is mapped For the vectors with 200 dimensions, the often one-dimensional of vector is a double precision numerical value.

Knowledge base chooses BabelNet, and BabelNet provides concept abundant and name entity, and passes through a large amount of language Adopted relationship interlinks, and semantic relation here refers to synonym relationship, hyponymy, integral part relationship etc..It is given Two words (concept or name entity), by means of the available respective synset of BabelNet API, and pass through language The synset of adopted relational links.Synset refers to a synonym collection, has unique identifier in BabelNet, Indicate a specific meaning of a word.Such as " bn:00021464n " instruction synset " computer, computing machine, computing device,data processor,electronic computer,information processing System " indicates a specific meaning of a word " computer, computer ".The Word similarity of word-based vector sum knowledge base is calculated The synset figure constructed in method, as shown in Fig. 7.

Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".

English is done between above-mentioned gained any two English word (the corresponding English word of each HowNet meaning of a word concept) Similarity calculation, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round and guidance _ vn_2:155807 | Direct is 0.292 is around _ v_0:124932 | revolve round and opinion _ n_3:143264 | complaint Is 0.3085 is around _ v_0:124932 | revolve round and opinion _ n_3:143267 | idea is 0.3742 encloses Around _ v_0:124932 | revolve round and implements _ v_6:047082 | and carry out is 0.4015 is around _ v_0: 124932 | revolve round and implements _ v_7:081572 | and feel at ease is 0.3575 is around _ v_0: 124932 | revolve round and implements _ v_7:081573 | and ascertain is 0.3215 is around _ v_0:124932 | Revolve round and implements _ v_7:081575 | and fulfil is 0.3541 is around _ v_0:124932 | revolve Round and combination _ v_9:064548 | be united in wedlock is 0.3299 is around _ v_0:124932 | Revolve round and combination _ v_9:064549 | combination is 0.3487 is around _ v_0:124932 | Revolve round and traditional Chinese medicine _ n_10:157339 | traditional Chinese medicine and druds Is 0.3520 is around _ v_0:124932 | revolve round and work _ vn_11:044068 | work is 0.3478 Around _ v_0:124932 | revolve round and reality _ n_13:109077 | reality is 0.3664 is around _ v_ 0:124932 | revolve round and reality _ n_13:109078 | practice is 0.3907 is around _ v_0: 124932 | revolve round and wants _ v_16:140522 | and want to is 0.3375 is around _ v_0:124932 | Revolve round and wants _ v_16:140530 | and ask is 0.3482 " shows only part since length is limited here Similarity result.

Similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector；

It should be noted that the meaning of a word of ambiguity word is more, trained term vector file is likely to tend to the ambiguity word Some more common meaning of a word.For this purpose, ambiguity word is converted into the meaning of a word possessed by it using HowNet, that is, each general The first justice read in definition is former, as shown in Fig. 5, ambiguity word " Chinese medicine " is converted to " people " and " knowledge ".

Gained any two Chinese word (corresponding to specific HowNet meaning of a word concept) is done based on the similar of term vector Degree calculates, obtain " Chinese medicine _ n_25: around _ v_0:124932 | around and guidance _ vn_2:155807 | order is-0.0145 encloses Around _ v_0:124932 | surround and opinion _ n_3:143264 | Chinese language is-0.0264 is around _ v_0:124932 | it surrounds And opinion _ n_3:143267 | thought is -0.0366 is around _ v_0:124932 | _ v_6:047082 is implemented around and | Implement is 0.2071 around _ v_0:124932 | implement _ v_7:081572 around and | feel at ease is -0.0430 around _ V_0:124932 | _ v_7:081573 is implemented around and | determine is 0.1502 around _ v_0:124932 | surround and Implement _ v_7:081575 | realize is 0.2254 around _ v_0:124932 | surround and combination _ v_9:064548 | it gets married Is -0.0183 is around _ v_0:124932 | surround and combination _ v_9:064549 | merge is 0.0745 around _ v_0: 124932 | surround and traditional Chinese medicine _ n_10:157339 | knowledge _ drug is 0.0866 is around _ v_0:124932 | it surrounds And work _ vn_11:044068 | is 0.1434 is around _ v_0:124932 | around and reality _ n_13:109077 | Entity is 0.1503 is around _ v_0:124932 | surround and reality _ n_13:109078 | thing is -0.0571 encloses Around _ v_0:124932 | _ v_16:140522 is wanted around and | expectation is 0.1009 is around _ v_0:124932 | around and Want _ v_16:140530 | it is required that is 0.2090 is around _ v_0:124932 | _ v_16:140532 is wanted around and | seek is 0.0496 around _ v_0:124932 | _ v_16:140534 is wanted around and | spend is 0.0176 around _ v_0:124932 | surrounding and increasing _ v_17:059967 | deformation shape is 0.0000 is around _ v_0:124932 | surround and increasing _ v_ 17:059968 | optimization is 0.2410 is around _ v_0:124932 | surround and increasing _ v_17:059969 | expand is 0.1911 around _ v_0:124932 | surround and dynamics _ n_18:076991 | intensity is 0.0592 is around _ v_0: 124932 | around and actively _ a_20:057562 | positive is 0.3089 is around _ v_0:124932 | around and product Pole _ a_20:057564 | positive is 0.0554 is around _ v_0:124932 | around and it is safe _ a_22:126267 | work as is 0.0245 around _ v_0:124932 | around and it is safe _ a_22:126269 | firm is 0.0490 is around _ v_0: 124932 | surround and propulsion _ v_24:122203 | advance is 0.1917 is around _ v_0:124932 | it is pushed away around and Into _ v_24:122206 | mobilize is 0.0277 around _ v_0:124932 | around and propulsion _ v_24:122211 | push away is 0.1740 around _ v_0:124932 | surround and Chinese medicine _ n_25:157332 | knowledge is 0.2205 is around _ v_0: 124932 | surround and Chinese medicine _ n_25:157329 | people is-0.0686 is around _ v_0:124932 | around and mechanism _ N_27:057323 | mechanism is 0.0945 is around _ v_0:124932 | surround and mechanism _ n_27:057325 | component is 0.0582 around _ v_0:124932 | surround and mechanism _ n_27:057326 | component is 0.0582 ".Since length has Limit, shows only part similarity result here.

Similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.

The similarity between each meaning of a word is calculated using the concept similarity kit that HowNet is provided, obtains and " Chinese medicine _ n_25: encloses Around _ v_0:124932 and guidance _ vn_2:155807 is 0.015094 around _ v_0:124932 and opinion _ n_3: 143264 is 0.000624 are around _ v_0:124932 and opinion _ n_3:143267 is 0.010256 around _ v_0: 124932 and implement _ and v_6:047082 is 0.013793 around _ v_0:124932 and implements _ v_7:081572 is 0.010256 implement around _ v_0:124932 and _ v_7:081573 is 0.013793 is around _ v_0:124932 and Practicable _ v_7:081575 is 0.013793 is around _ v_0:124932 and combination _ v_9:064548 is 0.016667 Around _ v_0:124932 and combination _ v_9:064549 is 0.018605 around _ v_0:124932 and traditional Chinese medicine _ n_ 10:157339 is 0.000624 around _ v_0:124932 and work _ vn_11:044065 is 0.000624 around _ V_0:124932 and work _ vn_11:044067 is 0.000624 is around _ v_0:124932 and work _ vn_11: 044068 is 0.015094 surrounds _ v_0 around _ v_0:124932 and reality _ n_13:109077 is 0.000624: 124932 and reality _ n_13:109078 is 0.000624 want _ v_16:140522 is around _ v_0:124932 and 0.010959 want around _ v_0:124932 and _ v_16:140530 is 0.015094 is around _ v_0:124932 and Want _ v_16:140532 is 0.018605 wants around _ v_0:124932 and _ v_16:140534 is 0.015094 encloses Around _ v_0:124932 and increasing _ v_17:059967 is 0.013793 around _ v_0:124932 and increasing _ v_17: 059968 is 0.015094 is around _ v_0:124932 and increasing _ v_17:059969 is 0.013793 around _ v_0: 124932 and dynamics _ n_18:076991 is 0.000624 around _ v_0:124932 and actively _ a_20:057562 Is 0.000624 around _ v_0:124932 and actively _ a_20:057564 is 0.000624 is around _ v_0:124932 And is safe _ a_22:126267 is 0.000624 around _ v_0:124932 and it is safe _ a_22:126269 is 0.000624”。

S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure；As shown in Fig. 3, specific step is as follows for building disambiguation figure:

S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter；The formula of simulated annealing progress parameter optimization are as follows:

2., if the probability value that generates at random is when being greater than p, give up and newly take parameter x_new。

The partial code of weight optimization algorithm based on simulated annealing is as shown in the table:

In weight optimization algorithm based on simulated annealing, row 1 is initialization operation, and setting initial temperature value t is 100, temperature Spending floor value t_min is 0.001, and cooling rate delta is set to 0.98, and greatest iteration step number k is set as 100；Row 2-3 be temperature with And the control of iterative steps；Row 4-5, the double-precision value of random selection 0 to 1-y are x assignment, and are z assignment 1-x-y；Row 6, letter Number getEvalResult (x, y, z) is objective function, function return value resulting disambiguation standard when being given weight parameter x, y, x True rate；Row 7 selects new value to be assigned to x_new in the neighborhood of x；Row 8-18, determines whether x_new retains to replace x, is specifically shown in The formula of simulated annealing progress parameter optimization；Row 20 changes t with the cooling rate of delta；Row 22 returns to x, y, z most Excellent parameter combination.

Wherein, x, y, z indicates the weight variable of three kinds of similarity results, and when executing algorithm for the first time, y is set as 1/3, this When algorithm after obtain the weight optimization parameter of x, y, at this moment min (x, y) is fixed up, continues to execute second of algorithm, After algorithm, other two weight parameters can be determined.

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word；Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword)；Wherein, what No. was indicated is concept number；Sword indicates the former word of the first justice Language；Enword indicates English word；No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads；Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.

The meaning of a word of this triple form enables above-mentioned three kinds of similarity calculating methods to be integrated into an entirety, with For " Chinese medicine ", " Chinese medicine " there are two the meaning of a word, correspond respectively to two meaning of a word triples, specific as follows: " Chinese medicine (157329, People, practitioner of Chinese medicine) ", " Chinese medicine (157332, knowledge, traditional Chinese Science) ", the side right weight in disambiguating figure between any two vertex, that is, the semantic similarity between the meaning of a word at this time, can be with It is obtained by the similarity calculation finally merged between the meaning of a word.

The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.As shown in Fig. 4, selecting the correct meaning of a word, specific step is as follows:

S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure；It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it；The specific PageRank score of one node Calculation formula are as follows:

Citing: after figure scoring, obtaining candidate meaning of a word list of concepts,

Chinese medicine _ n_25:157332 2.1213090873827947E58；

Chinese medicine _ n_25:157329 1.8434688340823378E58.

Citing: select meaning of a word concept highest scoring person for the correct meaning of a word, namely " Chinese medicine _ n_25:157332 ".

Embodiment 2:

As shown in Fig. 5, the present invention is based on the sense disambiguation systems of graph model, which includes,

Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet.Similarity calculated includes:

Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two；It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word；For this purpose, ambiguity word is converted into possessed by it using HowNet The first justice in the meaning of a word, that is, each concept definition is former, as shown in fig. 6, ambiguity word " Chinese medicine " is converted to " people " and " is known Know ".

HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.

Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure；Disambiguation figure construction unit includes,

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.The correct selecting unit of the meaning of a word includes,

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a word sense disambiguation method based on a graph model, is characterized in that, comprises the steps:

S1. Extract context knowledge: tag ambiguous sentences by part-of-speech, extract content words as context knowledge, and content words refer to nouns, verbs, adjectives, and adverbs;

S2, similarity calculation: do English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation;

S3. Construct a disambiguation graph: use the simulated annealing algorithm to optimize the weight of the similarity to obtain the similarity after fusion, and then take the word concept as the vertex, the semantic relationship between the concepts as the edge, and the weight of the edge is the similarity after fusion, Build a disambiguation graph;

S4. Correct selection of word meanings: Score the candidate word meanings in the picture through the graph score, and then obtain a score list of the candidate word meanings, and select the one with the highest score as the correct word meaning.

2. the word sense disambiguation method based on the graph model according to claim 1, is characterized in that, in described step S2, the specific steps of similarity calculation are as follows:

S201. English-based similarity calculation: perform HowNet semantic information annotation on the contextual knowledge, and perform semantic mapping processing to obtain a set of English words; then use a word similarity calculation algorithm based on word vectors and knowledge bases to compare the obtained English words. degree calculation;

S202, similarity calculation based on word vectors: use Google's word2vec toolkit to train word vectors on the corpus to obtain a word vector file, obtain word vectors corresponding to two given words according to the word vector file, and calculate the cosine between the word vectors Similarity as the similarity between the two;

S203. Similarity calculation based on HowNet: Use HowNet to mark the semantic information of the context knowledge, use the form of word vocabulary and concept number, and use the concept similarity toolkit provided by HowNet to calculate the similarity between each word meaning.

3. the word sense disambiguation method based on the graph model according to claim 2, is characterized in that, the word similarity calculation algorithm based on word vector and knowledge base in described step S201 is specifically as follows:

S20101. Determine whether the given word or phrase is:

①. If two English words are given, the similarity between the two words is obtained by calculating the cosine similarity of the two word vectors;

2. If the given word is a phrase, the word vectors corresponding to the words in the phrase need to be added to obtain the vector representation of the phrase, and the similarity of the phrase is obtained. The formula is as follows:

Among them, |p ₁ | and |p ₂ | represent the number of words contained in the phrases p ₁ and p ₂ ; _wi and w _j represent the i-th word in p ₁ and the j-th word in p ₂ , respectively;

S20102, iteratively search for synonym sets related to two English words until the number of iteration steps exceeds γ;

S20103. Construct a synonym set graph based on two English words and a synonym set related to the two English words;

S20104, within the set distance range in the figure, calculate the degree of coincidence of the synonym sets related to the two English words, and the formula is as follows:

sim _lap ( _wi , w _j )=d*count( _wi , w _j )/(count( _wi )+count(w _j ))

In the formula, count(w _i , w _j ) represents the number of synonym sets shared by words _wi and w _j ; count( _wi ) and count(w _j ) are the number of synonym sets _wi and w _j each have respectively. number; d represents the value of the set distance range;

S20105. Use the Dijkstra algorithm to calculate the shortest path between _wi and w _j in the graph, and obtain the similarity between _wi and w. The formula is as follows:

sim _bn ( _wi , w _j )=α*1/(δ ^path )+(1-α)sim _lap ( _wi , w _j )

Among them, path is the shortest path between _wi and w _j ; δ is used to adjust the value of similarity; sim _lap ( _wi , w _j ) represents the degree of coincidence between _wi and w _j ; parameter α is a Adjustment factor, adjust the similarity value of the two parts in the formula;

S20106, linearly add and combine the similarity sim _vec obtained based on the word vector method in step S20101 and the similarity sim _bn obtained based on the knowledge base method in step S20105 to obtain the final similarity, the formula is as follows:

sim _final ( _wi , w _j )=β*sim _vec +(1-β)*sim _bn

Among them, sim _bn and sim _vec represent the similarity obtained by the knowledge base method and the similarity obtained by the word vector method respectively; the parameter α is an adjustment factor, which adjusts the similarity results obtained by the knowledge base method and the word vector method;

S20107. Return the similarity sim _final .

4. the word sense disambiguation method based on graph model according to claim 1, is characterized in that, the concrete steps of constructing disambiguation graph in described step S3 are as follows:

S301, weight optimization: based on the weight optimization algorithm of simulated annealing, automatically optimize the three similarity values in step S2 to obtain the optimal weight parameter;

S302, similarity fusion: After the weight is optimized, the similarity formula for the final fusion between word meanings is:

sim(ws, ws′)=αsim _how +βsim _en +γsim _vec

Among them, ws and ws' represent two word meanings, sim _how represents the similarity calculation result based on HowNet, and the weight is α; sim _en represents the word similarity calculation result based on the word vector and knowledge base, and the weight is β; sim _vec means based on The similarity calculation result of the word vector, the weight is γ; among them, α+β+γ=1, α≥0, β≥0, γ≥0;

S303 , constructing a disambiguation graph: the disambiguation graph takes word senses as vertices and the semantic relationship between word senses as edges, and uses a weight optimization algorithm based on simulated annealing to integrate three similarity values as edge weights between word senses.

5. The word sense disambiguation method based on a graph model according to claim 4, wherein the formula for parameter optimization performed by the simulated annealing algorithm in the step S301 is:

Among them, result(x) represents the objective function, which refers to the disambiguation accuracy; δ represents the cooling rate; t represents the current temperature; x _new represents the new parameter; x _old represents the original parameter;

The meaning of the formula for parameter optimization by the simulated annealing algorithm includes the following two situations:

(a), if the value of the objective function of the new parameter x _new is not less than the value of the objective function of the original parameter x _old , then take the probability p as 1 to select the new parameter x _new ;

(b), if the value of the objective function of the new parameter x _new is smaller than the value of the objective function of the original parameter x _old , then the probability p is exp((result(x _new )-result(x _old ))/(δt) ) as the basis for selecting the parameter x _new , randomly generate a probability value, and determine the size of the randomly generated probability value and the probability p:

1. If the randomly generated probability value is not greater than p, select a new parameter x _new ;

2. If the randomly generated probability value is greater than p, the new parameter x _new is discarded;

The word meaning in the step S303 refers to a triple, which is represented as: Word (No., Sword, Enword); wherein, No. represents the concept number; Sword represents the first meaning of the original word; Enword represents the English word ;No., Sword, and Enword are organically unified wholes, describing the same word meaning concept; in HowNet, a word meaning concept number uniquely identifies a word meaning, and the first meaning of the original word can be obtained in its concept definition, and then map the word meaning. The meaning is an English word.

6. the word sense disambiguation method based on the graph model according to claim 1, is characterized in that, in described step S4, selects correct word sense concrete steps as follows:

S401, graph scoring: call the graph scoring method to score the importance of the word sense concept vertices in the disambiguation graph; after completing the graph scoring, arrange the candidate word sense concepts in descending order of scores to form a candidate word sense concept list;

S402, select the correct word meaning: select the correct word meaning in the disambiguation result, including the following two situations:

①. If there is only one word sense concept in the disambiguation result, the only one word sense concept is used as the correct word sense;

②. If the disambiguation result is a word meaning list composed of multiple word meaning concepts, the word meaning concept with the highest score is the correct word meaning.

7. the word sense disambiguation method based on graph model according to claim 6, is characterized in that, in described step S401, graph scoring adopts PageRank algorithm, and PageRank algorithm is to evaluate node in graph based on Markov chain model, The PageRank score of a node depends on the PageRank scores of all nodes linked to it; the specific PageRank score calculation formula of a node is:

Among them, 1-α represents the probability of jumping out of the current Markov chain and randomly selecting a node in the process of random walk; α is the probability of continuing the current Markov chain; N is the total number of nodes; |out( u)| represents the out-degree of node u; in(v) is all nodes linked to node v.

8. A word sense disambiguation system based on a graph model, characterized in that the system comprises,

Contextual knowledge extraction unit, which performs part-of-speech tagging on ambiguous sentences, and extracts content words as contextual knowledge, and content words refer to nouns, verbs, adjectives, and adverbs;

The similarity calculation unit is used for English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation;

The disambiguation graph construction unit is used to optimize the weight of the similarity by using the simulated annealing algorithm to obtain the similarity after fusion, and then take the word concept as the vertex, the semantic relationship between the concepts as the edge, and the weight of the edge is the similarity after fusion. , build a disambiguation graph;

The word sense correct selection unit is used to score the candidate word senses in the picture through the graph score, and then obtain a score list of the candidate word senses, and select the one with the largest score as the correct word sense.

9. The word sense disambiguation system based on a graph model according to claim 8, wherein the similarity calculation unit comprises:

The English similarity calculation unit is used to mark the context knowledge with HowNet semantic information, and perform the semantic mapping process to obtain a set of English words; then use the word similarity calculation algorithm based on word vector and knowledge base to calculate the similarity of the obtained English words. calculate;

The word vector similarity calculation unit is used to train the word vector on the corpus using Google's word2vec toolkit to obtain a word vector file, obtain the word vector corresponding to the given two words according to the word vector file, and calculate the cosine similarity between the word vectors. degree as the similarity between the two;

The HowNet similarity calculation unit is used to use HowNet to mark the semantic information of the context knowledge, in the form of word vocabulary and concept number, and use the concept similarity toolkit provided by HowNet to calculate the similarity between each word meaning;

The disambiguation graph construction unit includes,

The weight optimization unit is used for the weight optimization algorithm based on simulated annealing to automatically optimize the three similarity values of English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation to obtain the optimal weight. parameters; the formula for parameter optimization of the simulated annealing algorithm is:

Similarity fusion unit: After the weight is optimized, the similarity formula of the final fusion between word meanings is:

sim(ws, ws′)=αsim _how +βsim _en +γsim _vec

Construct a disambiguation graph unit for disambiguation graph with word senses as vertices and the semantic relationship between word senses as edges, using a weight optimization algorithm based on simulated annealing to integrate three similarity values as edge weights between word senses; is a triple, expressed as: Word(No.,Sword,Enword); Among them, No. represents the concept number; Sword represents the first meaning of the original word; Enword represents the English word; No., Sword, Enword three It is an organic and unified whole, describing the same word meaning concept; in HowNet, a word meaning concept number uniquely identifies a word meaning, and the first meaning of the original word can be obtained in its concept definition, and then the word meaning can be mapped to an English word.

10. The word sense disambiguation system based on a graph model according to claim 8 or 9, wherein the word sense correct selection unit comprises:

The graph scoring unit is used to call the graph scoring method to score the importance of the word sense concept vertices in the disambiguation graph; after the graph scoring is completed, the candidate word sense concepts are arranged in descending order of scores to form a list of candidate word sense concepts; graph scoring Using the PageRank algorithm, the PageRank algorithm evaluates the nodes in the graph based on the Markov chain model. The PageRank score of a node depends on the PageRank scores of all nodes linked to it; the specific PageRank score calculation formula of a node is:

Among them, 1-α represents the probability of jumping out of the current Markov chain and randomly selecting a node in the process of random walk; α is the probability of continuing the current Markov chain; N is the total number of nodes; |out( u)| represents the out-degree of node u; in(v) is all nodes linked to node v;

Selecting the correct word sense unit is used to select the correct word sense in the disambiguation result, including the following two cases: