CN109359303A - A word sense disambiguation method and system based on graph model - Google Patents

A word sense disambiguation method and system based on graph model Download PDF

Info

Publication number
CN109359303A
CN109359303A CN201811503355.7A CN201811503355A CN109359303A CN 109359303 A CN109359303 A CN 109359303A CN 201811503355 A CN201811503355 A CN 201811503355A CN 109359303 A CN109359303 A CN 109359303A
Authority
CN
China
Prior art keywords
word
similarity
meaning
disambiguation
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811503355.7A
Other languages
Chinese (zh)
Other versions
CN109359303B (en
Inventor
孟凡擎
燕孝飞
张强
陈文平
鹿文鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zaozhuang University
Original Assignee
Zaozhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zaozhuang University filed Critical Zaozhuang University
Priority to CN201811503355.7A priority Critical patent/CN109359303B/en
Publication of CN109359303A publication Critical patent/CN109359303A/en
Application granted granted Critical
Publication of CN109359303B publication Critical patent/CN109359303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Word sense disambiguation method and system based on graph model, belong to natural language processing technique field, the technical problem to be solved in the present invention is how to combine a variety of Chinese and English resources, have complementary advantages, realize the disambiguation knowledge sufficiently excavated in resource, promote word sense disambiguation performance, a kind of technical solution of use are as follows: 1. Word sense disambiguation method based on graph model, include the following steps: S1, extract Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, notional word is extracted as Context Knowledge, notional word is named word, verb, adjective, adverbial word;S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and based on the similarity calculation of HowNet;S3, building disambiguate figure;The correct selection of S4, the meaning of a word.2. a kind of sense disambiguation systems based on graph model, which includes Context Knowledge extraction unit, similarity calculated, disambiguation figure construction unit and the correct selecting unit of the meaning of a word.

Description

A kind of Word sense disambiguation method and system based on graph model
Technical field
The present invention relates to natural language processing technique field, specifically a kind of Word sense disambiguation method based on graph model And system.
Background technique
Word sense disambiguation refers to that the specific context environment according to locating for ambiguity word determines its specific meaning of a word, it is natural language One basic research of process field, to upper layers such as machine translation, information extraction, information retrieval, text classification, sentiment analysis Using directly affecting.The phenomenon that either other western languages such as Chinese or English, polysemy is generally existing.
Traditional method for carrying out Chinese word sense disambiguation task processing based on graph model is mainly utilized in one or more Literary knowledge resource, by the puzzlement of knowledge resource deficiency problem, word sense disambiguation performance is lower.Therefore how to combine a variety of Chinese and English moneys Source has complementary advantages, and realizes the disambiguation knowledge sufficiently excavated in resource, and promoting word sense disambiguation performance is current technology urgently to be solved Problem.
The patent document of Patent No. CN105893346A discloses a kind of graph model meaning of a word based on interdependent syntax tree and disappears Discrimination method the steps include: that 1. pairs of sentences are pre-processed and extract notional word to be disambiguated, and mainly include standardization processing, hyphenation And lemmatization etc.;2. pair sentence carries out interdependent syntactic analysis, its interdependent syntax tree is constructed;3. word is interdependent in acquisition sentence Distance on syntax tree, the i.e. length of shortest path;4. the meaning of a word concept building disambiguation for word in sentence is known according to knowledge base Know figure;5. being existed according to the semantic association path length, the weight of incidence edge, path end points that disambiguate in knowledge graph between meaning of a word node Distance on interdependent syntax tree calculates the figure score value of each meaning of a word node;6. being each ambiguity word, select figure score value maximum The meaning of a word as the correct meaning of a word.But the technical solution is using the semantic association relationship contained in BabelNet, rather than Semantic knowledge in HowNet;It is suitable for the work of English word sense disambiguation, but for Chinese and are not suitable for, and not can solve how In conjunction with a variety of Chinese and English resources, have complementary advantages, realizes the disambiguation knowledge sufficiently excavated in resource, promote asking for word sense disambiguation performance Topic.
Summary of the invention
Technical assignment of the invention is to provide a kind of Word sense disambiguation method and system based on graph model, to solve how to tie A variety of Chinese and English resources are closed, are had complementary advantages, the disambiguation knowledge sufficiently excavated in resource is realized, promotes asking for word sense disambiguation performance Topic.
Technical assignment of the invention realizes in the following manner, a kind of Word sense disambiguation method based on graph model, including Following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet;
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure;
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.
Preferably, specific step is as follows for similarity calculation in the step S2:
S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does the meaning of a word Mapping processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English word carries out similarity calculation;In addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire HowNet In English word information;
S202, the similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, is adopted The form of word language vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.
More preferably, the Word similarity algorithm of word-based vector sum knowledge base is specific as follows in the step S201:
What S20101, judgement gave is word or phrase:
If 1., it is given be two English words, two words are obtained by the cosine similarity of two word vectors of calculating Similarity between language;
If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain phrase to Amount indicates, acquires the similarity of phrase, formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word Language, p2In j-th of word;
S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ;
S20103, synonym is constructed based on two English words and synset relevant to two English words Collection figure;
S20104, in set distance range, the registration of synset relevant to two English words is calculated in figure, Formula is as follows:
simlap(wi, wj)=d*count (wi, wj)/(count(wi)+count(wj))
In formula, count (wi, wj) indicate word wiAnd wjThe synset number having jointly;count(wi) and count (wj) it is respectively wiAnd wjThe synset number respectively having;The value of d expression set distance range;
S20105, w in figure is calculated using dijkstra's algorithmiAnd wjBetween shortest path, obtain wiAnd wjIt is similar Degree, formula are as follows:
simbn(wi, wj1/ (δ of)=α *path)+(1-α)simlap(wi, wj)
Wherein, path is wiAnd wjBetween shortest path;Value of the δ to adjust similarity;simlap(wi, wj) indicate wiAnd wjBetween registration;Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula;
S20106, the similarity sim for obtaining vector approach word-based in step S20101vecWith base in step S20105 In the similarity sim that knowledge base method obtainsbn, linear, additive combination is carried out, obtains final similarity, formula is as follows:
simfinal(wi, wj)=β * simvec+(1-β)*simbn
Wherein, simbnAnd simvecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains The similarity arrived;Parameter alpha is a regulatory factor, adjusts the similarity that knowledge based library method and word-based vector approach obtain As a result;
S20107, similarity sim is returnedfinal
Preferably, specific step is as follows for building disambiguation figure in the step S3:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter;
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word.
More preferably, the simulated annealing in the step S301 carries out the formula of parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword);Its In, what No. was indicated is concept number;Sword indicates the first sense word;Enword indicates English word;No.,Sword, Enword three is the entirety of organic unity, describes the same meaning of a word concept;A meaning of a word concept number is unique in HowNet A meaning of a word is identified, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.
Preferably, selecting the correct meaning of a word in the step S4, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
More preferably, figure scoring uses PageRank algorithm in the step S401, and PageRank algorithm is based on Ma Erke Husband's chain model assesses node in figure, and the PageRank score of a node depends on all nodes linked with it PageRank score;The specific PageRank score calculation formula of one node are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v.
A kind of sense disambiguation systems based on graph model, the system include,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet;
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure;
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.
Preferably, the similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word;
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree;
The disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
More preferably, the correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure; After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Of the invention Word sense disambiguation method and system based on graph model has the advantage that
(1), for the present invention by combining a variety of Chinese and English resources, the disambiguation knowledge in resource is sufficiently excavated in mutual supplement with each other's advantages, Facilitate the promotion of word sense disambiguation performance;
(2), the present invention does the similarity calculation based on English, the similarity calculation based on term vector respectively and is based on The similarity calculation of HowNet, it is ensured that a variety of knowledge resources can be effectively integrated, improve and disambiguate accuracy rate;
(3), the present invention carries out weight optimization to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disambiguates figure, Guarantee the similarity value of a variety of knowledge resources of Automatic Optimal;
(4), when the present invention carries out English similarity calculation, HowNet word sense information mark is carried out to Context Knowledge, and Meaning of a word mapping processing is done, obtains English set of words, it is ensured that being capable of automatic aligning Chinese and English knowledge resource;
(5), the present invention gives a mark to the meaning of a word candidate in figure by figure scoring, and then obtains the score column of the candidate meaning of a word Table selects score the maximum for the correct meaning of a word, can realize the correct meaning transference to target ambiguities word automatically.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the flow diagram of the Word sense disambiguation method based on graph model;
Attached drawing 2 is the flow diagram of similarity calculation;
Attached drawing 3 is the flow diagram that building disambiguates figure;
Attached drawing 4 is the flow diagram of correct meaning transference;
Attached drawing 5 is the structural block diagram of the word sense disambiguation based on graph model;
Attached drawing 6 is the word sense information figure of citing Chinese medicine word;
Attached drawing 7 is the synset figure that constructs in the Word similarity algorithm of word-based vector sum knowledge base.
Specific embodiment
To a kind of Word sense disambiguation method based on graph model of the invention and it is referring to Figure of description and specific embodiment System is described in detail below.
Embodiment:
As shown in Fig. 1, the Word sense disambiguation method and system of the invention based on graph model, includes the following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Citing: with to " carrying out around " instruction ", in conjunction with the reality of work of Chinese medicine, increased force is wanted in various regions Degree, actively and steadily promotes TCM medical organization to reform." processing for, wherein " Chinese medicine " is to disambiguated term.Part-of-speech tagging Processing uses Chinese Academy of Sciences's Words partition system NLPIR-ICTCLAS.After part-of-speech tagging, " around/v "/wkz guidance/v opinion/n "/ Wky /ude1 implements/vn implements/vn ,/wd combination/v traditional Chinese medicine/n work/vn/ude1 reality/n ,/wd is each Ground/rzs wants/v increasing/v dynamics/n ,/wd actively/a and/cc it is safe/a /ude2 propulsion/vi Chinese medicine/n doctor Treatment/n mechanism/n reform/vn./ wj ", it is extracted notional word go forward side by side row format arrangement, to facilitate subsequent processing, obtain " Chinese medicine _ N_25: implement around _ v_0 guidance _ vn_2 opinion _ n_3 _ v_6 implements _ v_7 combination _ v_9 traditional Chinese medicine _ n_10 work _ Vn_11 reality _ n_13 wants _ v_16 increasing _ v_17 dynamics _ n_18 actively _ a_20 is safe _ a_22 propulsion _ v_24 in Doctor _ n_25 medical treatment _ n_26 mechanism _ n_27 reform _ vn_28 ", is wherein word to be disambiguated before colon, and the number after part of speech is single Word is the location of in sentence.
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet;
As shown in Fig. 2, specific step is as follows for similarity calculation:
Similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carry out similarity calculation, in addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information.The part main code of the Word similarity algorithm of word-based vector sum knowledge base is as follows:
In the Word similarity algorithm of word-based vector sum knowledge base, row 1 gives two English words, they Between similarity, obtained by both calculating the cosine similarity of term vector, if given word is phrase, by training institute Term vector in there is no phrase, need that phrase is further processed, by by the corresponding term vector of word in phrase It is added, the vector for obtaining phrase indicates, and then acquires the similarity of phrase, and formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word Language, p2In j-th of word.
Row 2-4 iteratively searches for synset relevant to word w1 and w2, until iterative steps step is more than γ, by The cost that figure calculates when node is excessive is larger, therefore sets 10 for greatest iteration step number γ;Row 5, with w1, w2 and they it Between associated synset be basic structure figures;Row 6 in figure within the scope of certain distance, calculates relevant to w1 and w2 same The registration of adopted word set, set distance 2, formula is as follows:
simlap(w1, w2)=2*count (w1, w2)/(count (w1)+count (w2))
In formula, count (w1, w2) indicates the synset number that word w1 and w2 have jointly;Count (w1) and Count (w2) is respectively the synset number that w1 and w2 respectively have.
Row 7 calculates the shortest path in figure between w1 and w2 using dijkstra's algorithm, further obtains the phase of w1 and w2 Like degree, formula is as follows:
simbn1/ (δ of (w1, w2)=α *path)+(1-α)simlap(w1, w2)
Wherein, path is the shortest path between w1 and w2;Value of the δ to adjust similarity, is set as 1.4;simlap (w1, w2) indicates the registration between w1 and w2;Parameter alpha is a regulatory factor, for adjusting the phase of two parts in formula Like angle value.
The method of the above-mentioned method based on term vector and knowledge based library (BabelNet) is carried out linear, additive knot by row 8 It closes, obtains final similarity, formula is as follows:
simfinal(w1, w2)=β * simvec+(1-β)*simbn
simbnAnd simvecRespectively indicate the similarity that the method in knowledge based library and the method based on term vector obtain;Ginseng Number α is a regulatory factor, is obtained for adjusting two methods as a result, being specifically configured to 0.6.
Row 9 returns to similarity simfinal
The processing in relation to term vector is to utilize in the Word similarity algorithm of word-based vector sum knowledge base Word2vec kit, on without mark English Wikipedia corpus, training term vector.Before training, data are carried out Pretreatment, by file format is converted to UTF-8 by Unicode.Training window is set as 5, and default vector dimension is set as 200, model selects Skip-gram.After training terminates, a term vector file is obtained, hereof, each word is mapped For the vectors with 200 dimensions, the often one-dimensional of vector is a double precision numerical value.
Knowledge base chooses BabelNet, and BabelNet provides concept abundant and name entity, and passes through a large amount of language Adopted relationship interlinks, and semantic relation here refers to synonym relationship, hyponymy, integral part relationship etc..It is given Two words (concept or name entity), by means of the available respective synset of BabelNet API, and pass through language The synset of adopted relational links.Synset refers to a synonym collection, has unique identifier in BabelNet, Indicate a specific meaning of a word.Such as " bn:00021464n " instruction synset " computer, computing machine, computing device,data processor,electronic computer,information processing System " indicates a specific meaning of a word " computer, computer ".The Word similarity of word-based vector sum knowledge base is calculated The synset figure constructed in method, as shown in Fig. 7.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
After doing meaning of a word mapping processing, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round is around _ v_ 0:124933 | centre on guidance _ vn_2:155807 | direct opinion _ n_3:143264 | complaint opinion _ n_ 3:143267 | idea implements _ v_6:047082 | and carry out implements _ v_7:081572 | and feel at ease implements _ v_ 7:081573 | ascertain implements _ v_7:081575 | fulfil combination _ v_9:064548 | be united in Wedlock combination _ v_9:064549 | combination traditional Chinese medicine _ n_10:157339 | traditional Chinese Medicine and druds work _ vn_11:044068 | work reality _ n_13:109077 | reality reality _ n_13: 109078 | practice wants _ v_16:140522 | and want to wants _ v_16:140530 | and ask wants _ v_16:140532 | ask For wants _ v_16:140534 | take increasing _ v_17:059967 | widen increasing _ v_17:059968 | enhance increasing _ V_17:059969 | enlarge dynamics _ n_18:076991 | dynamics actively _ a_20:057562 | active actively _ a_ 20:057564 | positive is safe _ a_22:126267 | and safe is safe _ a_22:126269 | reliable propulsion _ v_24: 122203 | move forward propulsion _ v_24:122206 | advance propulsion _ v_24:122211 | push into Chinese medicine _ N_25:157332 | traditional_Chinese_medical_science Chinese medicine _ n_25:157329 | Practitioner_of_Chinese_medicine mechanism _ n_27:057323 | institution mechanism _ n_27: 057325 | internal structure of an organization mechanism _ n_27:057326 | mechanism reform _ vn_28:041189|reform”。
English is done between above-mentioned gained any two English word (the corresponding English word of each HowNet meaning of a word concept) Similarity calculation, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round and guidance _ vn_2:155807 | Direct is 0.292 is around _ v_0:124932 | revolve round and opinion _ n_3:143264 | complaint Is 0.3085 is around _ v_0:124932 | revolve round and opinion _ n_3:143267 | idea is 0.3742 encloses Around _ v_0:124932 | revolve round and implements _ v_6:047082 | and carry out is 0.4015 is around _ v_0: 124932 | revolve round and implements _ v_7:081572 | and feel at ease is 0.3575 is around _ v_0: 124932 | revolve round and implements _ v_7:081573 | and ascertain is 0.3215 is around _ v_0:124932 | Revolve round and implements _ v_7:081575 | and fulfil is 0.3541 is around _ v_0:124932 | revolve Round and combination _ v_9:064548 | be united in wedlock is 0.3299 is around _ v_0:124932 | Revolve round and combination _ v_9:064549 | combination is 0.3487 is around _ v_0:124932 | Revolve round and traditional Chinese medicine _ n_10:157339 | traditional Chinese medicine and druds Is 0.3520 is around _ v_0:124932 | revolve round and work _ vn_11:044068 | work is 0.3478 Around _ v_0:124932 | revolve round and reality _ n_13:109077 | reality is 0.3664 is around _ v_ 0:124932 | revolve round and reality _ n_13:109078 | practice is 0.3907 is around _ v_0: 124932 | revolve round and wants _ v_16:140522 | and want to is 0.3375 is around _ v_0:124932 | Revolve round and wants _ v_16:140530 | and ask is 0.3482 " shows only part since length is limited here Similarity result.
Similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
It should be noted that the meaning of a word of ambiguity word is more, trained term vector file is likely to tend to the ambiguity word Some more common meaning of a word.For this purpose, ambiguity word is converted into the meaning of a word possessed by it using HowNet, that is, each general The first justice read in definition is former, as shown in Fig. 5, ambiguity word " Chinese medicine " is converted to " people " and " knowledge ".
Citing: after being handled using HowNet ambiguity word, obtain " Chinese medicine _ n_25: around _ v_0:124932 | it surrounds Around _ v_0:124933 | surround guidance _ vn_2:155807 | order opinion _ n_3:143264 | Chinese language opinion _ n_3: 143267 | thought implements _ v_6:047082 | implement to implement _ v_7:081572 | feel at ease to implement _ v_7:081573 | decision is fallen Reality _ v_7:081575 | realize combination _ v_9:064548 | marriage combination _ v_9:064549 | merge traditional Chinese medicine _ n_10: 157339 | knowledge _ drug work _ vn_11:044068 | it is reality _ n_13:109077 | entity reality _ n_13:109078 | Thing wants _ v_16:140522 | it is desirable that _ v_16:140530 | it is required that wanting _ v_16:140532 | it seeks and wants _ v_16: 140534 | spend increasing _ v_17:059967 | deformation shape increasing _ v_17:059968 | optimization increasing _ v_17:059969 | expand Great dynamics _ n_18:076991 | intensity actively _ a_20:057562 | actively actively _ a_20:057564 | the safe _ a_ in front 22:126267 | as safe _ a_22:126269 | firm propulsion _ v_24:122203 | advance propulsion _ v_24:122206 | hair Dynamic propulsion _ v_24:122211 | push away Chinese medicine _ n_25:157332 | knowledge Chinese medicine _ n_25:157329 | robot mechanism _ n_27: 057323 | mechanism _ n_27:057325 | part structures _ n_27:057326 | component reform _ vn_28:041189 | change It is good ".
Gained any two Chinese word (corresponding to specific HowNet meaning of a word concept) is done based on the similar of term vector Degree calculates, obtain " Chinese medicine _ n_25: around _ v_0:124932 | around and guidance _ vn_2:155807 | order is-0.0145 encloses Around _ v_0:124932 | surround and opinion _ n_3:143264 | Chinese language is-0.0264 is around _ v_0:124932 | it surrounds And opinion _ n_3:143267 | thought is -0.0366 is around _ v_0:124932 | _ v_6:047082 is implemented around and | Implement is 0.2071 around _ v_0:124932 | implement _ v_7:081572 around and | feel at ease is -0.0430 around _ V_0:124932 | _ v_7:081573 is implemented around and | determine is 0.1502 around _ v_0:124932 | surround and Implement _ v_7:081575 | realize is 0.2254 around _ v_0:124932 | surround and combination _ v_9:064548 | it gets married Is -0.0183 is around _ v_0:124932 | surround and combination _ v_9:064549 | merge is 0.0745 around _ v_0: 124932 | surround and traditional Chinese medicine _ n_10:157339 | knowledge _ drug is 0.0866 is around _ v_0:124932 | it surrounds And work _ vn_11:044068 | is 0.1434 is around _ v_0:124932 | around and reality _ n_13:109077 | Entity is 0.1503 is around _ v_0:124932 | surround and reality _ n_13:109078 | thing is -0.0571 encloses Around _ v_0:124932 | _ v_16:140522 is wanted around and | expectation is 0.1009 is around _ v_0:124932 | around and Want _ v_16:140530 | it is required that is 0.2090 is around _ v_0:124932 | _ v_16:140532 is wanted around and | seek is 0.0496 around _ v_0:124932 | _ v_16:140534 is wanted around and | spend is 0.0176 around _ v_0:124932 | surrounding and increasing _ v_17:059967 | deformation shape is 0.0000 is around _ v_0:124932 | surround and increasing _ v_ 17:059968 | optimization is 0.2410 is around _ v_0:124932 | surround and increasing _ v_17:059969 | expand is 0.1911 around _ v_0:124932 | surround and dynamics _ n_18:076991 | intensity is 0.0592 is around _ v_0: 124932 | around and actively _ a_20:057562 | positive is 0.3089 is around _ v_0:124932 | around and product Pole _ a_20:057564 | positive is 0.0554 is around _ v_0:124932 | around and it is safe _ a_22:126267 | work as is 0.0245 around _ v_0:124932 | around and it is safe _ a_22:126269 | firm is 0.0490 is around _ v_0: 124932 | surround and propulsion _ v_24:122203 | advance is 0.1917 is around _ v_0:124932 | it is pushed away around and Into _ v_24:122206 | mobilize is 0.0277 around _ v_0:124932 | around and propulsion _ v_24:122211 | push away is 0.1740 around _ v_0:124932 | surround and Chinese medicine _ n_25:157332 | knowledge is 0.2205 is around _ v_0: 124932 | surround and Chinese medicine _ n_25:157329 | people is-0.0686 is around _ v_0:124932 | around and mechanism _ N_27:057323 | mechanism is 0.0945 is around _ v_0:124932 | surround and mechanism _ n_27:057325 | component is 0.0582 around _ v_0:124932 | surround and mechanism _ n_27:057326 | component is 0.0582 ".Since length has Limit, shows only part similarity result here.
Similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
The similarity between each meaning of a word is calculated using the concept similarity kit that HowNet is provided, obtains and " Chinese medicine _ n_25: encloses Around _ v_0:124932 and guidance _ vn_2:155807 is 0.015094 around _ v_0:124932 and opinion _ n_3: 143264 is 0.000624 are around _ v_0:124932 and opinion _ n_3:143267 is 0.010256 around _ v_0: 124932 and implement _ and v_6:047082 is 0.013793 around _ v_0:124932 and implements _ v_7:081572 is 0.010256 implement around _ v_0:124932 and _ v_7:081573 is 0.013793 is around _ v_0:124932 and Practicable _ v_7:081575 is 0.013793 is around _ v_0:124932 and combination _ v_9:064548 is 0.016667 Around _ v_0:124932 and combination _ v_9:064549 is 0.018605 around _ v_0:124932 and traditional Chinese medicine _ n_ 10:157339 is 0.000624 around _ v_0:124932 and work _ vn_11:044065 is 0.000624 around _ V_0:124932 and work _ vn_11:044067 is 0.000624 is around _ v_0:124932 and work _ vn_11: 044068 is 0.015094 surrounds _ v_0 around _ v_0:124932 and reality _ n_13:109077 is 0.000624: 124932 and reality _ n_13:109078 is 0.000624 want _ v_16:140522 is around _ v_0:124932 and 0.010959 want around _ v_0:124932 and _ v_16:140530 is 0.015094 is around _ v_0:124932 and Want _ v_16:140532 is 0.018605 wants around _ v_0:124932 and _ v_16:140534 is 0.015094 encloses Around _ v_0:124932 and increasing _ v_17:059967 is 0.013793 around _ v_0:124932 and increasing _ v_17: 059968 is 0.015094 is around _ v_0:124932 and increasing _ v_17:059969 is 0.013793 around _ v_0: 124932 and dynamics _ n_18:076991 is 0.000624 around _ v_0:124932 and actively _ a_20:057562 Is 0.000624 around _ v_0:124932 and actively _ a_20:057564 is 0.000624 is around _ v_0:124932 And is safe _ a_22:126267 is 0.000624 around _ v_0:124932 and it is safe _ a_22:126269 is 0.000624”。
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure;As shown in Fig. 3, specific step is as follows for building disambiguation figure:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
The partial code of weight optimization algorithm based on simulated annealing is as shown in the table:
In weight optimization algorithm based on simulated annealing, row 1 is initialization operation, and setting initial temperature value t is 100, temperature Spending floor value t_min is 0.001, and cooling rate delta is set to 0.98, and greatest iteration step number k is set as 100;Row 2-3 be temperature with And the control of iterative steps;Row 4-5, the double-precision value of random selection 0 to 1-y are x assignment, and are z assignment 1-x-y;Row 6, letter Number getEvalResult (x, y, z) is objective function, function return value resulting disambiguation standard when being given weight parameter x, y, x True rate;Row 7 selects new value to be assigned to x_new in the neighborhood of x;Row 8-18, determines whether x_new retains to replace x, is specifically shown in The formula of simulated annealing progress parameter optimization;Row 20 changes t with the cooling rate of delta;Row 22 returns to x, y, z most Excellent parameter combination.
Wherein, x, y, z indicates the weight variable of three kinds of similarity results, and when executing algorithm for the first time, y is set as 1/3, this When algorithm after obtain the weight optimization parameter of x, y, at this moment min (x, y) is fixed up, continues to execute second of algorithm, After algorithm, other two weight parameters can be determined.
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Citing: after weight optimization, merging three kinds of similarity values according to the similarity formula finally merged between the meaning of a word, " Chinese medicine _ n_25: around _ v_0:124932 | revolve round | surround and guidance _ vn_2:155807 | direct | life Enable is 0.015094 | 0.2929 | -0.0145 around _ v_0:124932 | revolve round | around and opinion _ n_ 3:143264 | complaint | Chinese language is 0.000624 | 0.3085 | -0.0264 around _ v_0:124932 | revolve Round | surround and opinion _ n_3:143267 | idea | thought is 0.010256 | 0.3742 | -0.0366
Around _ v_0:124932 | revolve round | _ v_6:047082 is implemented around and | carry out | implement Is 0.013793 | 0.4015 | 0.2071 around _ v_0:124932 | revolve round | _ v_7 is implemented around and: 081572 | feel at ease | feel at ease is 0.010256 | and 0.3575 | -0.0430 around _ v_0:124932 | revolve Round | around and implement _ v_7:081573 | ascertain | determine is 0.013793 | 0.3215 | 0.1502 around _ V_0:124932 | revolve round | _ v_7:081575 is implemented around and | fulfil | realize is 0.013793 | 0.3541 | 0.2254 around _ v_0:124932 | revolve round | surround and combination _ v_9:064548 | be United in wedlock | marriage is 0.016667 | 0.3299 | -0.0183 around _ v_0:124932 | revolve Round | surround and combination _ v_9:064549 | combination | merge is 0.018605 | 0.3487 | 0.0745 encloses Around _ v_0:124932 | revolve round | surround and traditional Chinese medicine _ n_10:157339 | traditional Chinese Medicine and druds | knowledge _ drug is 0.000624 | 0.3520 | 0.0866 around _ v_0:124932 | Revolve round | surround and work _ vn_11:044068 | work | it is is 0.015094 | 0.3478 | 0.1434 encloses Around _ v_0:124932 | revolve round | surround and reality _ n_13:109077 | reality | entity is 0.000624 | 0.3664 | 0.1503 around _ v_0:124932 | revolve round | surround and reality _ n_13: 109078 | practice | thing is 0.000624 | 0.3907 | -0.0571 around _ v_0:124932 | revolve round | _ v_16:140522 is wanted around and | want to | expectation is 0.010959 | 0.3375 | 0.1009 around _ v_0: 124932 | revolve round | _ v_16:140530 is wanted around and | ask | it is required that is 0.015094 | 0.3482 | 0.2090 around _ v_0:124932 | revolve round | _ v_16:140532 is wanted around and | and ask for | seek is 0.018605 | 0.3648 | 0.0496 ", here in order to show that process does not further calculate, such as " 0.018605 | 0.3648 | 0.0496 " indicates three kinds of similarity values, is α 0.018605+ β 0.3648+ γ 0.0496 after their fusions.
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates the former word of the first justice Language;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
The meaning of a word of this triple form enables above-mentioned three kinds of similarity calculating methods to be integrated into an entirety, with For " Chinese medicine ", " Chinese medicine " there are two the meaning of a word, correspond respectively to two meaning of a word triples, specific as follows: " Chinese medicine (157329, People, practitioner of Chinese medicine) ", " Chinese medicine (157332, knowledge, traditional Chinese Science) ", the side right weight in disambiguating figure between any two vertex, that is, the semantic similarity between the meaning of a word at this time, can be with It is obtained by the similarity calculation finally merged between the meaning of a word.
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.As shown in Fig. 4, selecting the correct meaning of a word, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it;The specific PageRank score of one node Calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v.
Citing: after figure scoring, obtaining candidate meaning of a word list of concepts,
Chinese medicine _ n_25:157332 2.1213090873827947E58;
Chinese medicine _ n_25:157329 1.8434688340823378E58.
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Citing: select meaning of a word concept highest scoring person for the correct meaning of a word, namely " Chinese medicine _ n_25:157332 ".
Embodiment 2:
As shown in Fig. 5, the present invention is based on the sense disambiguation systems of graph model, which includes,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet.Similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word;For this purpose, ambiguity word is converted into possessed by it using HowNet The first justice in the meaning of a word, that is, each concept definition is former, as shown in fig. 6, ambiguity word " Chinese medicine " is converted to " people " and " is known Know ".
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure;Disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.The correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure; After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1.一种基于图模型的词义消歧方法,其特征在于,包括如下步骤:1. a word sense disambiguation method based on a graph model, is characterized in that, comprises the steps: S1、提取上下文知识:对歧义句进行词性标注,提取实词作为上下文知识,实词指名词、动词、形容词、副词;S1. Extract context knowledge: tag ambiguous sentences by part-of-speech, extract content words as context knowledge, and content words refer to nouns, verbs, adjectives, and adverbs; S2、相似度计算:分别做基于英文的相似度计算、基于词向量的相似度计算和基于HowNet的相似度计算;S2, similarity calculation: do English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation; S3、构建消歧图:利用模拟退火算法对相似度进行权重优化,得到融合后的相似度,进而以词语概念为顶点,概念间的语义关系为边,边的权重为融合后的相似度,构建出消歧图;S3. Construct a disambiguation graph: use the simulated annealing algorithm to optimize the weight of the similarity to obtain the similarity after fusion, and then take the word concept as the vertex, the semantic relationship between the concepts as the edge, and the weight of the edge is the similarity after fusion, Build a disambiguation graph; S4、词义的正确选择:通过图评分对图中候选词义进行打分,进而得到候选词义的得分列表,选择得分最高者作为正确词义。S4. Correct selection of word meanings: Score the candidate word meanings in the picture through the graph score, and then obtain a score list of the candidate word meanings, and select the one with the highest score as the correct word meaning. 2.根据权利要求1所述的基于图模型的词义消歧方法,其特征在于,所述步骤S2中相似度计算具体步骤如下:2. the word sense disambiguation method based on the graph model according to claim 1, is characterized in that, in described step S2, the specific steps of similarity calculation are as follows: S201、基于英文的相似度计算:对上下文知识进行HowNet词义信息标注,并做词义映射处理,得到英文词语集合;再利用基于词向量和知识库的词语相似度计算算法,对所得英文词语进行相似度计算;S201. English-based similarity calculation: perform HowNet semantic information annotation on the contextual knowledge, and perform semantic mapping processing to obtain a set of English words; then use a word similarity calculation algorithm based on word vectors and knowledge bases to compare the obtained English words. degree calculation; S202、基于词向量的相似度计算:使用Google的word2vec工具包在该语料上训练词向量,得到词向量文件,根据词向量文件获取给定两个词语对应的词向量,计算词向量间的余弦相似度作为两者的相似度;S202, similarity calculation based on word vectors: use Google's word2vec toolkit to train word vectors on the corpus to obtain a word vector file, obtain word vectors corresponding to two given words according to the word vector file, and calculate the cosine between the word vectors Similarity as the similarity between the two; S203、基于HowNet的相似度计算:利用HowNet对上下文知识进行词义信息标注,采用词语词汇和概念编号的形式,利用HowNet提供的概念相似度工具包计算各词义间的相似度。S203. Similarity calculation based on HowNet: Use HowNet to mark the semantic information of the context knowledge, use the form of word vocabulary and concept number, and use the concept similarity toolkit provided by HowNet to calculate the similarity between each word meaning. 3.根据权利要求2所述的基于图模型的词义消歧方法,其特征在于,所述步骤S201中基于词向量和知识库的词语相似度计算算法具体如下:3. the word sense disambiguation method based on the graph model according to claim 2, is characterized in that, the word similarity calculation algorithm based on word vector and knowledge base in described step S201 is specifically as follows: S20101、判断给定的是词语还是短语:S20101. Determine whether the given word or phrase is: ①、若给定是两个英文词语,则通过计算两词语向量的cosine相似度得到两个词语之间的相似度;①. If two English words are given, the similarity between the two words is obtained by calculating the cosine similarity of the two word vectors; ②、若给定词语为短语,则需要将短语中的词语对应的词向量相加,得到短语的向量表示,求得短语的相似度,公式如下:2. If the given word is a phrase, the word vectors corresponding to the words in the phrase need to be added to obtain the vector representation of the phrase, and the similarity of the phrase is obtained. The formula is as follows: 其中,|p1|和|p2|表示短语p1和p2所含词语的个数;wi和wj分别表示p1中的第i个词语,p2中的第j个词语;Among them, |p 1 | and |p 2 | represent the number of words contained in the phrases p 1 and p 2 ; wi and w j represent the i-th word in p 1 and the j-th word in p 2 , respectively; S20102、迭代地搜索与两个英文词语相关的同义词集,直到迭代步数超过γ;S20102, iteratively search for synonym sets related to two English words until the number of iteration steps exceeds γ; S20103、以两个英文词语以及与两个英文词语相关的同义词集为基础构建同义词集图;S20103. Construct a synonym set graph based on two English words and a synonym set related to the two English words; S20104、在图中设定距离范围内,计算与两个英文词语相关的同义词集的重合度,公式如下:S20104, within the set distance range in the figure, calculate the degree of coincidence of the synonym sets related to the two English words, and the formula is as follows: simlap(wi,wj)=d*count(wi,wj)/(count(wi)+count(wj))sim lap ( wi , w j )=d*count( wi , w j )/(count( wi )+count(w j )) 式中,count(wi,wj)表示词语wi和wj共同具有的同义词集个数;count(wi)和count(wj)分别为wi和wj各自具有的同义词集个数;d表示设定距离范围的取值;In the formula, count(w i , w j ) represents the number of synonym sets shared by words wi and w j ; count( wi ) and count(w j ) are the number of synonym sets wi and w j each have respectively. number; d represents the value of the set distance range; S20105、使用Dijkstra算法计算图中wi和wj之间的最短路径,得到wi和w的相似度,公式如下:S20105. Use the Dijkstra algorithm to calculate the shortest path between wi and w j in the graph, and obtain the similarity between wi and w. The formula is as follows: simbn(wi,wj)=α*1/(δpath)+(1-α)simlap(wi,wj)sim bn ( wi , w j )=α*1/(δ path )+(1-α)sim lap ( wi , w j ) 其中,path是wi和wj之间的最短路径;δ用以调节相似度的取值;simlap(wi,wj)表示wi和wj之间的重合度;参数α是一个调节因子,调节公式中两个部分的相似度值;Among them, path is the shortest path between wi and w j ; δ is used to adjust the value of similarity; sim lap ( wi , w j ) represents the degree of coincidence between wi and w j ; parameter α is a Adjustment factor, adjust the similarity value of the two parts in the formula; S20106、将步骤S20101中基于词向量方法得到的相似度simvec和步骤S20105中基于知识库方法得到的相似度simbn,进行线性相加结合,得到最终的相似度,公式如下:S20106, linearly add and combine the similarity sim vec obtained based on the word vector method in step S20101 and the similarity sim bn obtained based on the knowledge base method in step S20105 to obtain the final similarity, the formula is as follows: simfinal(wi,wj)=β*simvec+(1-β)*simbn sim final ( wi , w j )=β*sim vec +(1-β)*sim bn 其中,simbn和simvec分别表示基于知识库方法得到的相似度和基于词向量方法得到的相似度;参数α是一个调节因子,调节基于知识库方法和基于词向量方法得到的相似度结果;Among them, sim bn and sim vec represent the similarity obtained by the knowledge base method and the similarity obtained by the word vector method respectively; the parameter α is an adjustment factor, which adjusts the similarity results obtained by the knowledge base method and the word vector method; S20107、返回相似度simfinalS20107. Return the similarity sim final . 4.根据权利要求1所述的基于图模型的词义消歧方法,其特征在于,所述步骤S3中构建消歧图的具体步骤如下:4. the word sense disambiguation method based on graph model according to claim 1, is characterized in that, the concrete steps of constructing disambiguation graph in described step S3 are as follows: S301、权重优化:基于模拟退火的权重优化算法,对步骤S2中的三种相似度值进行自动优化,得到最优权重参数;S301, weight optimization: based on the weight optimization algorithm of simulated annealing, automatically optimize the three similarity values in step S2 to obtain the optimal weight parameter; S302、相似度融合:权重优化之后,词义间最终融合的相似度公式为:S302, similarity fusion: After the weight is optimized, the similarity formula for the final fusion between word meanings is: sim(ws,ws′)=αsimhow+βsimen+γsimvec sim(ws, ws′)=αsim how +βsim en +γsim vec 其中,ws和ws’表示两个词义,simhow表示基于HowNet的相似度计算结果,权重为α;simen表示基于词向量和知识库的词语相似度计算结果,权重为β;simvec表示基于词向量的相似度计算结果,权重为γ;其中,α+β+γ=1,α≥0,β≥0,γ≥0;Among them, ws and ws' represent two word meanings, sim how represents the similarity calculation result based on HowNet, and the weight is α; sim en represents the word similarity calculation result based on the word vector and knowledge base, and the weight is β; sim vec means based on The similarity calculation result of the word vector, the weight is γ; among them, α+β+γ=1, α≥0, β≥0, γ≥0; S303、构建消歧图:消歧图以词义为顶点,词义间的语义关系为边,利用基于模拟退火的权重优化算法,整合三种相似度值作为词义间的边权重。S303 , constructing a disambiguation graph: the disambiguation graph takes word senses as vertices and the semantic relationship between word senses as edges, and uses a weight optimization algorithm based on simulated annealing to integrate three similarity values as edge weights between word senses. 5.根据权利要求4所述的基于图模型的词义消歧方法,其特征在于,所述步骤S301中的模拟退火算法进行参数优化的公式为:5. The word sense disambiguation method based on a graph model according to claim 4, wherein the formula for parameter optimization performed by the simulated annealing algorithm in the step S301 is: 其中,result(x)表示目标函数,指的是消歧准确率;δ表示冷却速率;t表示当前所处温度;xnew表示新取参数;xold表示原参数;Among them, result(x) represents the objective function, which refers to the disambiguation accuracy; δ represents the cooling rate; t represents the current temperature; x new represents the new parameter; x old represents the original parameter; 模拟退火算法进行参数优化的公式表示的含义包括如下两种情况:The meaning of the formula for parameter optimization by the simulated annealing algorithm includes the following two situations: (a)、若新取参数xnew的目标函数取值不小于原参数xold的目标函数取值,则以概率p为1选择新取参数xnew(a), if the value of the objective function of the new parameter x new is not less than the value of the objective function of the original parameter x old , then take the probability p as 1 to select the new parameter x new ; (b)、若新取参数xnew的目标函数取值小于原参数xold的目标函数取值,则以概率p为exp((result(xnew)-result(xold))/(δt))作为选取参数xnew的依据,随机生成一个概率值,并判断随机生成的概率值与概率p的大小:(b), if the value of the objective function of the new parameter x new is smaller than the value of the objective function of the original parameter x old , then the probability p is exp((result(x new )-result(x old ))/(δt) ) as the basis for selecting the parameter x new , randomly generate a probability value, and determine the size of the randomly generated probability value and the probability p: ①、若随机生成的概率值不大于p时,则选择新取参数xnew1. If the randomly generated probability value is not greater than p, select a new parameter x new ; ②、若随机生成的概率值大于p时,则舍弃新取参数xnew2. If the randomly generated probability value is greater than p, the new parameter x new is discarded; 所述步骤S303中的词义指的是一个三元组,表示为:Word(No.,Sword,Enword);其中,No.表示的是概念编号;Sword表示第一义原词语;Enword表示英文词语;No.、Sword、Enword三者是有机统一的整体,描述同一个词义概念;在HowNet中一个词义概念编号唯一标识一个词义,在其概念定义中可以获取到第一义原词语,进而映射该词义为英文词语。The word meaning in the step S303 refers to a triple, which is represented as: Word (No., Sword, Enword); wherein, No. represents the concept number; Sword represents the first meaning of the original word; Enword represents the English word ;No., Sword, and Enword are organically unified wholes, describing the same word meaning concept; in HowNet, a word meaning concept number uniquely identifies a word meaning, and the first meaning of the original word can be obtained in its concept definition, and then map the word meaning. The meaning is an English word. 6.根据权利要求1所述的基于图模型的词义消歧方法,其特征在于,所述步骤S4中选择正确词义具体步骤如下:6. the word sense disambiguation method based on the graph model according to claim 1, is characterized in that, in described step S4, selects correct word sense concrete steps as follows: S401、图评分:调用图评分方法对消歧图中词义概念顶点的重要度进行评分;完成图评分后,将候选词义概念按照得分从大到小进行排列,构成候选词义概念列表;S401, graph scoring: call the graph scoring method to score the importance of the word sense concept vertices in the disambiguation graph; after completing the graph scoring, arrange the candidate word sense concepts in descending order of scores to form a candidate word sense concept list; S402、选择正确词义:在消歧结果中选择正确词义,包括如下两种情况:S402, select the correct word meaning: select the correct word meaning in the disambiguation result, including the following two situations: ①、若消歧结果中仅有一个词义概念,则将仅有的一个词义概念作为正确词义;①. If there is only one word sense concept in the disambiguation result, the only one word sense concept is used as the correct word sense; ②、若消歧结果是由多个词义概念构成的词义列表,则以词义概念得分最高者为正确词义。②. If the disambiguation result is a word meaning list composed of multiple word meaning concepts, the word meaning concept with the highest score is the correct word meaning. 7.根据权利要求6所述的基于图模型的词义消歧方法,其特征在于,所述步骤S401中图评分采用PageRank算法,PageRank算法是基于马尔科夫链模型对图中结点进行评估,一个结点的PageRank得分取决于与其链接的所有结点的PageRank得分;一个结点的具体PageRank得分计算公式为:7. the word sense disambiguation method based on graph model according to claim 6, is characterized in that, in described step S401, graph scoring adopts PageRank algorithm, and PageRank algorithm is to evaluate node in graph based on Markov chain model, The PageRank score of a node depends on the PageRank scores of all nodes linked to it; the specific PageRank score calculation formula of a node is: 其中,1-α表示在随机游走过程中,跳出当前马尔可夫链随机选择一个结点的概率;α是指继续当前马尔可夫链的概率;N为总的结点数量;|out(u)|表示结点u的出度;in(v)为链接到结点v的所有结点。Among them, 1-α represents the probability of jumping out of the current Markov chain and randomly selecting a node in the process of random walk; α is the probability of continuing the current Markov chain; N is the total number of nodes; |out( u)| represents the out-degree of node u; in(v) is all nodes linked to node v. 8.一种基于图模型的词义消歧系统,其特征在于,该系统包括,8. A word sense disambiguation system based on a graph model, characterized in that the system comprises, 上下文知识提取单元,对歧义句进行词性标注,提取实词作为上下文知识,实词指名词、动词、形容词、副词;Contextual knowledge extraction unit, which performs part-of-speech tagging on ambiguous sentences, and extracts content words as contextual knowledge, and content words refer to nouns, verbs, adjectives, and adverbs; 相似度计算单元,用于分别做基于英文的相似度计算、基于词向量的相似度计算以及基于HowNet的相似度计算;The similarity calculation unit is used for English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation; 消歧图构建单元,用于利用模拟退火算法对相似度进行权重优化,得到融合后的相似度,进而以词语概念为顶点,概念间的语义关系为边,边的权重为融合后的相似度,构建消歧图;The disambiguation graph construction unit is used to optimize the weight of the similarity by using the simulated annealing algorithm to obtain the similarity after fusion, and then take the word concept as the vertex, the semantic relationship between the concepts as the edge, and the weight of the edge is the similarity after fusion. , build a disambiguation graph; 词义正确选择单元,用于通过图评分对图中候选词义进行打分,进而得到候选词义的得分列表,选择得分最大者为正确词义。The word sense correct selection unit is used to score the candidate word senses in the picture through the graph score, and then obtain a score list of the candidate word senses, and select the one with the largest score as the correct word sense. 9.根据权利要求8所述的基于图模型的词义消歧系统,其特征在于,所述相似度计算单元包括:9. The word sense disambiguation system based on a graph model according to claim 8, wherein the similarity calculation unit comprises: 英文相似度计算单元,用于对上下文知识进行HowNet词义信息标注,并做词义映射处理,得到英文词语集合;再利用基于词向量和知识库的词语相似度计算算法,对所得英文词语进行相似度计算;The English similarity calculation unit is used to mark the context knowledge with HowNet semantic information, and perform the semantic mapping process to obtain a set of English words; then use the word similarity calculation algorithm based on word vector and knowledge base to calculate the similarity of the obtained English words. calculate; 词向量相似度计算单元,用于使用Google的word2vec工具包在该语料上训练词向量,得到词向量文件,根据词向量文件获取给定两个词语对应的词向量,计算词向量间的余弦相似度作为两者的相似度;The word vector similarity calculation unit is used to train the word vector on the corpus using Google's word2vec toolkit to obtain a word vector file, obtain the word vector corresponding to the given two words according to the word vector file, and calculate the cosine similarity between the word vectors. degree as the similarity between the two; HowNet相似度计算单元,用于利用HowNet对上下文知识进行词义信息标注,采用词语词汇和概念编号的形式,利用HowNet提供的概念相似度工具包计算各词义间的相似度;The HowNet similarity calculation unit is used to use HowNet to mark the semantic information of the context knowledge, in the form of word vocabulary and concept number, and use the concept similarity toolkit provided by HowNet to calculate the similarity between each word meaning; 所述消歧图构建单元包括,The disambiguation graph construction unit includes, 权重优化单元,用于基于模拟退火的权重优化算法,对基于英文的相似度计算、基于词向量的相似度计算以及基于HowNet的相似度计算的三种相似度值进行自动优化,得到最优权重参数;模拟退火算法进行参数优化的公式为:The weight optimization unit is used for the weight optimization algorithm based on simulated annealing to automatically optimize the three similarity values of English-based similarity calculation, word vector-based similarity calculation and HowNet-based similarity calculation to obtain the optimal weight. parameters; the formula for parameter optimization of the simulated annealing algorithm is: 其中,result(x)表示目标函数,指的是消歧准确率;δ表示冷却速率;t表示当前所处温度;xnew表示新取参数;xold表示原参数;Among them, result(x) represents the objective function, which refers to the disambiguation accuracy; δ represents the cooling rate; t represents the current temperature; x new represents the new parameter; x old represents the original parameter; 模拟退火算法进行参数优化的公式表示的含义包括如下两种情况:The meaning of the formula for parameter optimization by the simulated annealing algorithm includes the following two situations: (a)、若新取参数xnew的目标函数取值不小于原参数xold的目标函数取值,则以概率p为1选择新取参数xnew(a), if the value of the objective function of the new parameter x new is not less than the value of the objective function of the original parameter x old , then take the probability p as 1 to select the new parameter x new ; (b)、若新取参数xnew的目标函数取值小于原参数xold的目标函数取值,则以概率p为exp((result(xnew)-result(xold))/(δt))作为选取参数xnew的依据,随机生成一个概率值,并判断随机生成的概率值与概率p的大小:(b), if the value of the objective function of the new parameter x new is smaller than the value of the objective function of the original parameter x old , then the probability p is exp((result(x new )-result(x old ))/(δt) ) as the basis for selecting the parameter x new , randomly generate a probability value, and determine the size of the randomly generated probability value and the probability p: ①、若随机生成的概率值不大于p时,则选择新取参数xnew1. If the randomly generated probability value is not greater than p, select a new parameter x new ; ②、若随机生成的概率值大于p时,则舍弃新取参数xnew2. If the randomly generated probability value is greater than p, the new parameter x new is discarded; 相似度融合单元:权重优化之后,词义间最终融合的相似度公式为:Similarity fusion unit: After the weight is optimized, the similarity formula of the final fusion between word meanings is: sim(ws,ws′)=αsimhow+βsimen+γsimvec sim(ws, ws′)=αsim how +βsim en +γsim vec 其中,ws和ws’表示两个词义,simhow表示基于HowNet的相似度计算结果,权重为α;simen表示基于词向量和知识库的词语相似度计算结果,权重为β;simvec表示基于词向量的相似度计算结果,权重为γ;其中,α+β+γ=1,α≥0,β≥0,γ≥0;Among them, ws and ws' represent two word meanings, sim how represents the similarity calculation result based on HowNet, and the weight is α; sim en represents the word similarity calculation result based on the word vector and knowledge base, and the weight is β; sim vec means based on The similarity calculation result of the word vector, the weight is γ; among them, α+β+γ=1, α≥0, β≥0, γ≥0; 构建消歧图单元,用于消歧图以词义为顶点,词义间的语义关系为边,利用基于模拟退火的权重优化算法,整合三种相似度值作为词义间的边权重;其中,词义指的是一个三元组,表示为:Word(No.,Sword,Enword);其中,No.表示的是概念编号;Sword表示第一义原词语;Enword表示英文词语;No.、Sword、Enword三者是有机统一的整体,描述同一个词义概念;在HowNet中一个词义概念编号唯一标识一个词义,在其概念定义中可以获取到第一义原词语,进而映射该词义为英文词语。Construct a disambiguation graph unit for disambiguation graph with word senses as vertices and the semantic relationship between word senses as edges, using a weight optimization algorithm based on simulated annealing to integrate three similarity values as edge weights between word senses; is a triple, expressed as: Word(No.,Sword,Enword); Among them, No. represents the concept number; Sword represents the first meaning of the original word; Enword represents the English word; No., Sword, Enword three It is an organic and unified whole, describing the same word meaning concept; in HowNet, a word meaning concept number uniquely identifies a word meaning, and the first meaning of the original word can be obtained in its concept definition, and then the word meaning can be mapped to an English word. 10.根据权利要求8或9所述的基于图模型的词义消歧系统,其特征在于,所述词义正确选择单元包括,10. The word sense disambiguation system based on a graph model according to claim 8 or 9, wherein the word sense correct selection unit comprises: 图评分单元,用于调用图评分方法对消歧图中词义概念顶点的重要度进行评分;完成图评分后,将候选词义概念按照得分从大到小进行排列,构成候选词义概念列表;图评分采用PageRank算法,PageRank算法是基于马尔科夫链模型对图中结点进行评估,一个结点的PageRank得分取决于与其链接的所有结点的PageRank得分;一个结点的具体PageRank得分计算公式为:The graph scoring unit is used to call the graph scoring method to score the importance of the word sense concept vertices in the disambiguation graph; after the graph scoring is completed, the candidate word sense concepts are arranged in descending order of scores to form a list of candidate word sense concepts; graph scoring Using the PageRank algorithm, the PageRank algorithm evaluates the nodes in the graph based on the Markov chain model. The PageRank score of a node depends on the PageRank scores of all nodes linked to it; the specific PageRank score calculation formula of a node is: 其中,1-α表示在随机游走过程中,跳出当前马尔可夫链随机选择一个结点的概率;α是指继续当前马尔可夫链的概率;N为总的结点数量;|out(u)|表示结点u的出度;in(v)为链接到结点v的所有结点;Among them, 1-α represents the probability of jumping out of the current Markov chain and randomly selecting a node in the process of random walk; α is the probability of continuing the current Markov chain; N is the total number of nodes; |out( u)| represents the out-degree of node u; in(v) is all nodes linked to node v; 选择正确词义单元,用于在消歧结果中选择正确词义,包括如下两种情况:Selecting the correct word sense unit is used to select the correct word sense in the disambiguation result, including the following two cases: ①、若消歧结果中仅有一个词义概念,则将仅有的一个词义概念作为正确词义;①. If there is only one word sense concept in the disambiguation result, the only one word sense concept is used as the correct word sense; ②、若消歧结果是由多个词义概念构成的词义列表,则以词义概念得分最高者为正确词义。②. If the disambiguation result is a word meaning list composed of multiple word meaning concepts, the word meaning concept with the highest score is the correct word meaning.
CN201811503355.7A 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model Active CN109359303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811503355.7A CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811503355.7A CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Publications (2)

Publication Number Publication Date
CN109359303A true CN109359303A (en) 2019-02-19
CN109359303B CN109359303B (en) 2023-04-07

Family

ID=65332018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811503355.7A Active CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Country Status (1)

Country Link
CN (1) CN109359303B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 Syntax tree library construction system
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A Text Domain Determination Method and System Based on Domain Semantic Relationship Graph
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110705295A (en) * 2019-09-11 2020-01-17 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110766072A (en) * 2019-10-22 2020-02-07 探智立方(北京)科技有限公司 Automatic generation method of computational graph evolution AI model based on structural similarity
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN112256885A (en) * 2020-10-23 2021-01-22 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN115114397A (en) * 2022-05-09 2022-09-27 泰康保险集团股份有限公司 Annuity information updating method, device, electronic device, storage medium, and program
CN119477229A (en) * 2025-01-15 2025-02-18 浙商银行股份有限公司 A smart contract disambiguation method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017128A1 (en) * 2000-08-24 2002-02-28 Science Applications International Corporation Word sense disambiguation
WO2014087506A1 (en) * 2012-12-05 2014-06-12 三菱電機株式会社 Word meaning estimation device, word meaning estimation method, and word meaning estimation program
WO2016050066A1 (en) * 2014-09-29 2016-04-07 华为技术有限公司 Method and device for parsing interrogative sentence in knowledge base
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017128A1 (en) * 2000-08-24 2002-02-28 Science Applications International Corporation Word sense disambiguation
WO2014087506A1 (en) * 2012-12-05 2014-06-12 三菱電機株式会社 Word meaning estimation device, word meaning estimation method, and word meaning estimation program
WO2016050066A1 (en) * 2014-09-29 2016-04-07 华为技术有限公司 Method and device for parsing interrogative sentence in knowledge base
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鹿文鹏: "基于依存和领域知识的词义消歧方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A Text Domain Determination Method and System Based on Domain Semantic Relationship Graph
CN110413989B (en) * 2019-06-19 2020-11-20 北京邮电大学 A text domain determination method and system based on domain semantic relation graph
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 Syntax tree library construction system
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110705295A (en) * 2019-09-11 2020-01-17 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110705295B (en) * 2019-09-11 2021-08-24 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110766072A (en) * 2019-10-22 2020-02-07 探智立方(北京)科技有限公司 Automatic generation method of computational graph evolution AI model based on structural similarity
CN111310475B (en) * 2020-02-04 2023-03-10 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN111783418B (en) * 2020-06-09 2024-04-05 北京北大软件工程股份有限公司 Chinese word meaning representation learning method and device
CN112256885A (en) * 2020-10-23 2021-01-22 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN112256885B (en) * 2020-10-23 2023-10-27 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN113158687B (en) * 2021-04-29 2021-12-28 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN115114397A (en) * 2022-05-09 2022-09-27 泰康保险集团股份有限公司 Annuity information updating method, device, electronic device, storage medium, and program
CN115114397B (en) * 2022-05-09 2024-05-31 泰康保险集团股份有限公司 Annuity information updating method, annuity information updating device, electronic device, storage medium, and program
CN119477229A (en) * 2025-01-15 2025-02-18 浙商银行股份有限公司 A smart contract disambiguation method, device, equipment and storage medium
CN119477229B (en) * 2025-01-15 2025-06-20 浙商银行股份有限公司 A smart contract disambiguation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109359303B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109359303A (en) A word sense disambiguation method and system based on graph model
Qi et al. Openhownet: An open sememe-based lexical knowledge base
US9514098B1 (en) Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases
CN109213995A (en) A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN107305539A (en) A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN104361127A (en) Multilanguage question and answer interface fast constituting method based on domain ontology and template logics
Ramisch et al. mwetoolkit: A framework for multiword expression identification.
Vlachos et al. A new corpus and imitation learning framework for context-dependent semantic parsing
CN109614620A (en) A Method and System for Word Sense Disambiguation Based on HowNet
Xie et al. Knowledge base question answering based on deep learning models
Piryani et al. Sentiment analysis in Nepali: Exploring machine learning and lexicon-based approaches
US20150161109A1 (en) Reordering words for machine translation
Nishihara et al. Word complexity estimation for Japanese lexical simplification
Kang Spoken language to sign language translation system based on HamNoSys
Houssein et al. Semantic protocol and resource description framework query language: a comprehensive review
Park et al. Frame-Semantic Web: a Case Study for Korean.
Kaffee et al. Multilingual knowledge graphs and low-resource languages: A review
CN108255818B (en) A compound machine translation method using segmentation technology
Harshawardhan et al. Phrase based English-Tamil translation system by concept labeling using translation memory
CN108280066B (en) Off-line translation method from Chinese to English
Marinova Evaluation of stacked embeddings for Bulgarian on the downstream tasks POS and NERC
Papadias et al. Educing knowledge from text: Semantic information extraction of spatial concepts and places
Huang et al. A simple, straightforward and effective model for joint bilingual terms detection and word alignment in SMT
Passban et al. Improving phrase-based SMT using cross-granularity embedding similarity
Le et al. Technical term similarity model for natural language based data retrieval in civil infrastructure projects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190219

Assignee: SHANDONG ZHENGKAI NEW MATERIALS CO.,LTD.

Assignor: ZAOZHUANG University

Contract record no.: X2024980014476

Denomination of invention: A method and system for word sense disambiguation based on graph model

Granted publication date: 20230407

License type: Common License

Record date: 20240912

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190219

Assignee: Shandong East Rail Power Technology Co.,Ltd.

Assignor: ZAOZHUANG University

Contract record no.: X2025980009984

Denomination of invention: A method and system for word sense disambiguation based on graph model

Granted publication date: 20230407

License type: Common License

Record date: 20250605

Application publication date: 20190219

Assignee: Shandong Chaoyue Garment Co.,Ltd.

Assignor: ZAOZHUANG University

Contract record no.: X2025980009974

Denomination of invention: A method and system for word sense disambiguation based on graph model

Granted publication date: 20230407

License type: Common License

Record date: 20250605

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190219

Assignee: ZAOZHUANG AOSEN MUSICAL INSTRUMENT CO.,LTD.

Assignor: ZAOZHUANG University

Contract record no.: X2025980010355

Denomination of invention: A method and system for word sense disambiguation based on graph model

Granted publication date: 20230407

License type: Common License

Record date: 20250612