CN109359303A - A kind of Word sense disambiguation method and system based on graph model - Google Patents
A kind of Word sense disambiguation method and system based on graph model Download PDFInfo
- Publication number
- CN109359303A CN109359303A CN201811503355.7A CN201811503355A CN109359303A CN 109359303 A CN109359303 A CN 109359303A CN 201811503355 A CN201811503355 A CN 201811503355A CN 109359303 A CN109359303 A CN 109359303A
- Authority
- CN
- China
- Prior art keywords
- word
- meaning
- similarity
- concept
- sim
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 96
- 238000004364 calculation method Methods 0.000 claims abstract description 75
- 239000000284 extract Substances 0.000 claims abstract description 11
- 238000010276 construction Methods 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 238000005457 optimization Methods 0.000 claims description 51
- 238000004422 calculation algorithm Methods 0.000 claims description 45
- 230000006870 function Effects 0.000 claims description 32
- 238000002922 simulated annealing Methods 0.000 claims description 29
- 238000012545 processing Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 10
- 238000001816 cooling Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000013459 approach Methods 0.000 claims description 6
- 238000005295 random walk Methods 0.000 claims description 6
- 238000004064 recycling Methods 0.000 claims description 6
- 230000001105 regulatory effect Effects 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 4
- 239000000654 additive Substances 0.000 claims description 3
- 230000000996 additive effect Effects 0.000 claims description 3
- 238000004088 simulation Methods 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 6
- 230000000295 complement effect Effects 0.000 abstract description 4
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 239000003814 drug Substances 0.000 description 50
- 230000001965 increasing effect Effects 0.000 description 21
- 230000007246 mechanism Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 5
- 229940079593 drug Drugs 0.000 description 5
- 238000000137 annealing Methods 0.000 description 4
- 239000000047 product Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012850 discrimination method Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of Word sense disambiguation method and system based on graph model, belong to natural language processing technique field, the technical problem to be solved in the present invention is how to combine a variety of Chinese and English resources, have complementary advantages, realize the disambiguation knowledge sufficiently excavated in resource, promote word sense disambiguation performance, a kind of technical solution of use are as follows: 1. Word sense disambiguation method based on graph model, include the following steps: S1, extract Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, notional word is extracted as Context Knowledge, notional word is named word, verb, adjective, adverbial word;S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and based on the similarity calculation of HowNet;S3, building disambiguate figure;The correct selection of S4, the meaning of a word.2. a kind of sense disambiguation systems based on graph model, which includes Context Knowledge extraction unit, similarity calculated, disambiguation figure construction unit and the correct selecting unit of the meaning of a word.
Description
Technical field
The present invention relates to natural language processing technique field, specifically a kind of Word sense disambiguation method based on graph model
And system.
Background technique
Word sense disambiguation refers to that the specific context environment according to locating for ambiguity word determines its specific meaning of a word, it is natural language
One basic research of process field, to upper layers such as machine translation, information extraction, information retrieval, text classification, sentiment analysis
Using directly affecting.The phenomenon that either other western languages such as Chinese or English, polysemy is generally existing.
Traditional method for carrying out Chinese word sense disambiguation task processing based on graph model is mainly utilized in one or more
Literary knowledge resource, by the puzzlement of knowledge resource deficiency problem, word sense disambiguation performance is lower.Therefore how to combine a variety of Chinese and English moneys
Source has complementary advantages, and realizes the disambiguation knowledge sufficiently excavated in resource, and promoting word sense disambiguation performance is current technology urgently to be solved
Problem.
The patent document of Patent No. CN105893346A discloses a kind of graph model meaning of a word based on interdependent syntax tree and disappears
Discrimination method the steps include: that 1. pairs of sentences are pre-processed and extract notional word to be disambiguated, and mainly include standardization processing, hyphenation
And lemmatization etc.;2. pair sentence carries out interdependent syntactic analysis, its interdependent syntax tree is constructed;3. word is interdependent in acquisition sentence
Distance on syntax tree, the i.e. length of shortest path;4. the meaning of a word concept building disambiguation for word in sentence is known according to knowledge base
Know figure;5. being existed according to the semantic association path length, the weight of incidence edge, path end points that disambiguate in knowledge graph between meaning of a word node
Distance on interdependent syntax tree calculates the figure score value of each meaning of a word node;6. being each ambiguity word, select figure score value maximum
The meaning of a word as the correct meaning of a word.But the technical solution is using the semantic association relationship contained in BabelNet, rather than
Semantic knowledge in HowNet;It is suitable for the work of English word sense disambiguation, but for Chinese and are not suitable for, and not can solve how
In conjunction with a variety of Chinese and English resources, have complementary advantages, realizes the disambiguation knowledge sufficiently excavated in resource, promote asking for word sense disambiguation performance
Topic.
Summary of the invention
Technical assignment of the invention is to provide a kind of Word sense disambiguation method and system based on graph model, to solve how to tie
A variety of Chinese and English resources are closed, are had complementary advantages, the disambiguation knowledge sufficiently excavated in resource is realized, promotes asking for word sense disambiguation performance
Topic.
Technical assignment of the invention realizes in the following manner, a kind of Word sense disambiguation method based on graph model, including
Following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to
Noun, verb, adjective, adverbial word;
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on
The similarity calculation of HowNet;
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar
Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed
Disambiguate figure;
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word
Score list selects highest scoring person as the correct meaning of a word.
Preferably, specific step is as follows for similarity calculation in the step S2:
S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does the meaning of a word
Mapping processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained
English word carries out similarity calculation;In addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire HowNet
In English word information;
S202, the similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's
Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition
The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, is adopted
The form of word language vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided
Degree.
More preferably, the Word similarity algorithm of word-based vector sum knowledge base is specific as follows in the step S201:
What S20101, judgement gave is word or phrase:
If 1., it is given be two English words, two words are obtained by the cosine similarity of two word vectors of calculating
Similarity between language;
If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain phrase to
Amount indicates, acquires the similarity of phrase, formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word
Language, p2In j-th of word;
S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ;
S20103, synonym is constructed based on two English words and synset relevant to two English words
Collection figure;
S20104, in set distance range, the registration of synset relevant to two English words is calculated in figure,
Formula is as follows:
simlap(wi, wj)=d*count (wi, wj)/(count(wi)+count(wj))
In formula, count (wi, wj) indicate word wiAnd wjThe synset number having jointly;count(wi) and count
(wj) it is respectively wiAnd wjThe synset number respectively having;The value of d expression set distance range;
S20105, w in figure is calculated using dijkstra's algorithmiAnd wjBetween shortest path, obtain wiAnd wjIt is similar
Degree, formula are as follows:
simbn(wi, wj1/ (δ of)=α *path)+(1-α)simlap(wi, wj)
Wherein, path is wiAnd wjBetween shortest path;Value of the δ to adjust similarity;simlap(wi, wj) indicate
wiAnd wjBetween registration;Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula;
S20106, the similarity sim for obtaining vector approach word-based in step S20101vecWith base in step S20105
In the similarity sim that knowledge base method obtainsbn, linear, additive combination is carried out, obtains final similarity, formula is as follows:
simfinal(wi, wj)=β * simvec+(1-β)*simbn
Wherein, simbnAnd simvecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains
The similarity arrived;Parameter alpha is a regulatory factor, adjusts the similarity that knowledge based library method and word-based vector approach obtain
As a result;
S20107, similarity sim is returnedfinal。
Preferably, specific step is as follows for building disambiguation figure in the step S3:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2
Automatic Optimal obtains optimal weights parameter;
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;
simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector
Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation
The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word.
More preferably, the simulated annealing in the step S301 carries out the formula of parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute
Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability
P is that 1 selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p
For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability
Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew;
The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword);Its
In, what No. was indicated is concept number;Sword indicates the first sense word;Enword indicates English word;No.,Sword,
Enword three is the entirety of organic unity, describes the same meaning of a word concept;A meaning of a word concept number is unique in HowNet
A meaning of a word is identified, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.
Preferably, selecting the correct meaning of a word in the step S4, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes
After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is
The correct meaning of a word.
More preferably, figure scoring uses PageRank algorithm in the step S401, and PageRank algorithm is based on Ma Erke
Husband's chain model assesses node in figure, and the PageRank score of a node depends on all nodes linked with it
PageRank score;The specific PageRank score calculation formula of one node are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node
Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in
It (v) is all nodes for being linked to node v.
A kind of sense disambiguation systems based on graph model, the system include,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to
Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector
And the similarity calculation based on HowNet;
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused
Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure
Build disambiguation figure;
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word
The score list of justice, selects score the maximum for the correct meaning of a word.
Preferably, the similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected
Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English
Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet
English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to
Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector
Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used
It can tend to some more common meaning of a word of the ambiguity word;
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet
The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided
Degree;
The disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English
Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain
Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute
Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability
P is that 1 selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p
For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability
Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew;
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;
simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector
Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould
The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three
Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former
Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word
It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition
Former word, and then the meaning of a word is mapped for English word.
More preferably, the correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure;
After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented
Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot
The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node
Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node
Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in
It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is
The correct meaning of a word.
Of the invention Word sense disambiguation method and system based on graph model has the advantage that
(1), for the present invention by combining a variety of Chinese and English resources, the disambiguation knowledge in resource is sufficiently excavated in mutual supplement with each other's advantages,
Facilitate the promotion of word sense disambiguation performance;
(2), the present invention does the similarity calculation based on English, the similarity calculation based on term vector respectively and is based on
The similarity calculation of HowNet, it is ensured that a variety of knowledge resources can be effectively integrated, improve and disambiguate accuracy rate;
(3), the present invention carries out weight optimization to similarity using simulated annealing, obtains fused similarity, into
And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disambiguates figure,
Guarantee the similarity value of a variety of knowledge resources of Automatic Optimal;
(4), when the present invention carries out English similarity calculation, HowNet word sense information mark is carried out to Context Knowledge, and
Meaning of a word mapping processing is done, obtains English set of words, it is ensured that being capable of automatic aligning Chinese and English knowledge resource;
(5), the present invention gives a mark to the meaning of a word candidate in figure by figure scoring, and then obtains the score column of the candidate meaning of a word
Table selects score the maximum for the correct meaning of a word, can realize the correct meaning transference to target ambiguities word automatically.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the flow diagram of the Word sense disambiguation method based on graph model;
Attached drawing 2 is the flow diagram of similarity calculation;
Attached drawing 3 is the flow diagram that building disambiguates figure;
Attached drawing 4 is the flow diagram of correct meaning transference;
Attached drawing 5 is the structural block diagram of the word sense disambiguation based on graph model;
Attached drawing 6 is the word sense information figure of citing Chinese medicine word;
Attached drawing 7 is the synset figure that constructs in the Word similarity algorithm of word-based vector sum knowledge base.
Specific embodiment
To a kind of Word sense disambiguation method based on graph model of the invention and it is referring to Figure of description and specific embodiment
System is described in detail below.
Embodiment:
As shown in Fig. 1, the Word sense disambiguation method and system of the invention based on graph model, includes the following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to
Noun, verb, adjective, adverbial word;
Citing: with to " carrying out around " instruction ", in conjunction with the reality of work of Chinese medicine, increased force is wanted in various regions
Degree, actively and steadily promotes TCM medical organization to reform." processing for, wherein " Chinese medicine " is to disambiguated term.Part-of-speech tagging
Processing uses Chinese Academy of Sciences's Words partition system NLPIR-ICTCLAS.After part-of-speech tagging, " around/v "/wkz guidance/v opinion/n "/
Wky /ude1 implements/vn implements/vn ,/wd combination/v traditional Chinese medicine/n work/vn/ude1 reality/n ,/wd is each
Ground/rzs wants/v increasing/v dynamics/n ,/wd actively/a and/cc it is safe/a /ude2 propulsion/vi Chinese medicine/n doctor
Treatment/n mechanism/n reform/vn./ wj ", it is extracted notional word go forward side by side row format arrangement, to facilitate subsequent processing, obtain " Chinese medicine _
N_25: implement around _ v_0 guidance _ vn_2 opinion _ n_3 _ v_6 implements _ v_7 combination _ v_9 traditional Chinese medicine _ n_10 work _
Vn_11 reality _ n_13 wants _ v_16 increasing _ v_17 dynamics _ n_18 actively _ a_20 is safe _ a_22 propulsion _ v_24 in
Doctor _ n_25 medical treatment _ n_26 mechanism _ n_27 reform _ vn_28 ", is wherein word to be disambiguated before colon, and the number after part of speech is single
Word is the location of in sentence.
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on
The similarity calculation of HowNet;
As shown in Fig. 2, specific step is as follows for similarity calculation:
Similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping
Processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English
Word carry out similarity calculation, in addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet
English word information.The part main code of the Word similarity algorithm of word-based vector sum knowledge base is as follows:
In the Word similarity algorithm of word-based vector sum knowledge base, row 1 gives two English words, they
Between similarity, obtained by both calculating the cosine similarity of term vector, if given word is phrase, by training institute
Term vector in there is no phrase, need that phrase is further processed, by by the corresponding term vector of word in phrase
It is added, the vector for obtaining phrase indicates, and then acquires the similarity of phrase, and formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word
Language, p2In j-th of word.
Row 2-4 iteratively searches for synset relevant to word w1 and w2, until iterative steps step is more than γ, by
The cost that figure calculates when node is excessive is larger, therefore sets 10 for greatest iteration step number γ;Row 5, with w1, w2 and they it
Between associated synset be basic structure figures;Row 6 in figure within the scope of certain distance, calculates relevant to w1 and w2 same
The registration of adopted word set, set distance 2, formula is as follows:
simlap(w1, w2)=2*count (w1, w2)/(count (w1)+count (w2))
In formula, count (w1, w2) indicates the synset number that word w1 and w2 have jointly;Count (w1) and
Count (w2) is respectively the synset number that w1 and w2 respectively have.
Row 7 calculates the shortest path in figure between w1 and w2 using dijkstra's algorithm, further obtains the phase of w1 and w2
Like degree, formula is as follows:
simbn1/ (δ of (w1, w2)=α *path)+(1-α)simlap(w1, w2)
Wherein, path is the shortest path between w1 and w2;Value of the δ to adjust similarity, is set as 1.4;simlap
(w1, w2) indicates the registration between w1 and w2;Parameter alpha is a regulatory factor, for adjusting the phase of two parts in formula
Like angle value.
The method of the above-mentioned method based on term vector and knowledge based library (BabelNet) is carried out linear, additive knot by row 8
It closes, obtains final similarity, formula is as follows:
simfinal(w1, w2)=β * simvec+(1-β)*simbn
simbnAnd simvecRespectively indicate the similarity that the method in knowledge based library and the method based on term vector obtain;Ginseng
Number α is a regulatory factor, is obtained for adjusting two methods as a result, being specifically configured to 0.6.
Row 9 returns to similarity simfinal。
The processing in relation to term vector is to utilize in the Word similarity algorithm of word-based vector sum knowledge base
Word2vec kit, on without mark English Wikipedia corpus, training term vector.Before training, data are carried out
Pretreatment, by file format is converted to UTF-8 by Unicode.Training window is set as 5, and default vector dimension is set as
200, model selects Skip-gram.After training terminates, a term vector file is obtained, hereof, each word is mapped
For the vectors with 200 dimensions, the often one-dimensional of vector is a double precision numerical value.
Knowledge base chooses BabelNet, and BabelNet provides concept abundant and name entity, and passes through a large amount of language
Adopted relationship interlinks, and semantic relation here refers to synonym relationship, hyponymy, integral part relationship etc..It is given
Two words (concept or name entity), by means of the available respective synset of BabelNet API, and pass through language
The synset of adopted relational links.Synset refers to a synonym collection, has unique identifier in BabelNet,
Indicate a specific meaning of a word.Such as " bn:00021464n " instruction synset " computer, computing machine,
computing device,data processor,electronic computer,information processing
System " indicates a specific meaning of a word " computer, computer ".The Word similarity of word-based vector sum knowledge base is calculated
The synset figure constructed in method, as shown in Fig. 7.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25:
Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3:
143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _
V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13:
109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16:
140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product
Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24:
122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine
Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
After doing meaning of a word mapping processing, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round is around _ v_
0:124933 | centre on guidance _ vn_2:155807 | direct opinion _ n_3:143264 | complaint opinion _ n_
3:143267 | idea implements _ v_6:047082 | and carry out implements _ v_7:081572 | and feel at ease implements _ v_
7:081573 | ascertain implements _ v_7:081575 | fulfil combination _ v_9:064548 | be united in
Wedlock combination _ v_9:064549 | combination traditional Chinese medicine _ n_10:157339 | traditional Chinese
Medicine and druds work _ vn_11:044068 | work reality _ n_13:109077 | reality reality _ n_13:
109078 | practice wants _ v_16:140522 | and want to wants _ v_16:140530 | and ask wants _ v_16:140532 | ask
For wants _ v_16:140534 | take increasing _ v_17:059967 | widen increasing _ v_17:059968 | enhance increasing _
V_17:059969 | enlarge dynamics _ n_18:076991 | dynamics actively _ a_20:057562 | active actively _ a_
20:057564 | positive is safe _ a_22:126267 | and safe is safe _ a_22:126269 | reliable propulsion _ v_24:
122203 | move forward propulsion _ v_24:122206 | advance propulsion _ v_24:122211 | push into Chinese medicine _
N_25:157332 | traditional_Chinese_medical_science Chinese medicine _ n_25:157329 |
Practitioner_of_Chinese_medicine mechanism _ n_27:057323 | institution mechanism _ n_27:
057325 | internal structure of an organization mechanism _ n_27:057326 | mechanism reform _
vn_28:041189|reform”。
English is done between above-mentioned gained any two English word (the corresponding English word of each HowNet meaning of a word concept)
Similarity calculation, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round and guidance _ vn_2:155807 |
Direct is 0.292 is around _ v_0:124932 | revolve round and opinion _ n_3:143264 | complaint
Is 0.3085 is around _ v_0:124932 | revolve round and opinion _ n_3:143267 | idea is 0.3742 encloses
Around _ v_0:124932 | revolve round and implements _ v_6:047082 | and carry out is 0.4015 is around _ v_0:
124932 | revolve round and implements _ v_7:081572 | and feel at ease is 0.3575 is around _ v_0:
124932 | revolve round and implements _ v_7:081573 | and ascertain is 0.3215 is around _ v_0:124932 |
Revolve round and implements _ v_7:081575 | and fulfil is 0.3541 is around _ v_0:124932 | revolve
Round and combination _ v_9:064548 | be united in wedlock is 0.3299 is around _ v_0:124932 |
Revolve round and combination _ v_9:064549 | combination is 0.3487 is around _ v_0:124932 |
Revolve round and traditional Chinese medicine _ n_10:157339 | traditional Chinese medicine and druds
Is 0.3520 is around _ v_0:124932 | revolve round and work _ vn_11:044068 | work is 0.3478
Around _ v_0:124932 | revolve round and reality _ n_13:109077 | reality is 0.3664 is around _ v_
0:124932 | revolve round and reality _ n_13:109078 | practice is 0.3907 is around _ v_0:
124932 | revolve round and wants _ v_16:140522 | and want to is 0.3375 is around _ v_0:124932 |
Revolve round and wants _ v_16:140530 | and ask is 0.3482 " shows only part since length is limited here
Similarity result.
Similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's
Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition
The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
It should be noted that the meaning of a word of ambiguity word is more, trained term vector file is likely to tend to the ambiguity word
Some more common meaning of a word.For this purpose, ambiguity word is converted into the meaning of a word possessed by it using HowNet, that is, each general
The first justice read in definition is former, as shown in Fig. 5, ambiguity word " Chinese medicine " is converted to " people " and " knowledge ".
Citing: after being handled using HowNet ambiguity word, obtain " Chinese medicine _ n_25: around _ v_0:124932 | it surrounds
Around _ v_0:124933 | surround guidance _ vn_2:155807 | order opinion _ n_3:143264 | Chinese language opinion _ n_3:
143267 | thought implements _ v_6:047082 | implement to implement _ v_7:081572 | feel at ease to implement _ v_7:081573 | decision is fallen
Reality _ v_7:081575 | realize combination _ v_9:064548 | marriage combination _ v_9:064549 | merge traditional Chinese medicine _ n_10:
157339 | knowledge _ drug work _ vn_11:044068 | it is reality _ n_13:109077 | entity reality _ n_13:109078 |
Thing wants _ v_16:140522 | it is desirable that _ v_16:140530 | it is required that wanting _ v_16:140532 | it seeks and wants _ v_16:
140534 | spend increasing _ v_17:059967 | deformation shape increasing _ v_17:059968 | optimization increasing _ v_17:059969 | expand
Great dynamics _ n_18:076991 | intensity actively _ a_20:057562 | actively actively _ a_20:057564 | the safe _ a_ in front
22:126267 | as safe _ a_22:126269 | firm propulsion _ v_24:122203 | advance propulsion _ v_24:122206 | hair
Dynamic propulsion _ v_24:122211 | push away Chinese medicine _ n_25:157332 | knowledge Chinese medicine _ n_25:157329 | robot mechanism _ n_27:
057323 | mechanism _ n_27:057325 | part structures _ n_27:057326 | component reform _ vn_28:041189 | change
It is good ".
Gained any two Chinese word (corresponding to specific HowNet meaning of a word concept) is done based on the similar of term vector
Degree calculates, obtain " Chinese medicine _ n_25: around _ v_0:124932 | around and guidance _ vn_2:155807 | order is-0.0145 encloses
Around _ v_0:124932 | surround and opinion _ n_3:143264 | Chinese language is-0.0264 is around _ v_0:124932 | it surrounds
And opinion _ n_3:143267 | thought is -0.0366 is around _ v_0:124932 | _ v_6:047082 is implemented around and |
Implement is 0.2071 around _ v_0:124932 | implement _ v_7:081572 around and | feel at ease is -0.0430 around _
V_0:124932 | _ v_7:081573 is implemented around and | determine is 0.1502 around _ v_0:124932 | surround and
Implement _ v_7:081575 | realize is 0.2254 around _ v_0:124932 | surround and combination _ v_9:064548 | it gets married
Is -0.0183 is around _ v_0:124932 | surround and combination _ v_9:064549 | merge is 0.0745 around _ v_0:
124932 | surround and traditional Chinese medicine _ n_10:157339 | knowledge _ drug is 0.0866 is around _ v_0:124932 | it surrounds
And work _ vn_11:044068 | is 0.1434 is around _ v_0:124932 | around and reality _ n_13:109077 |
Entity is 0.1503 is around _ v_0:124932 | surround and reality _ n_13:109078 | thing is -0.0571 encloses
Around _ v_0:124932 | _ v_16:140522 is wanted around and | expectation is 0.1009 is around _ v_0:124932 | around and
Want _ v_16:140530 | it is required that is 0.2090 is around _ v_0:124932 | _ v_16:140532 is wanted around and | seek is
0.0496 around _ v_0:124932 | _ v_16:140534 is wanted around and | spend is 0.0176 around _ v_0:124932
| surrounding and increasing _ v_17:059967 | deformation shape is 0.0000 is around _ v_0:124932 | surround and increasing _ v_
17:059968 | optimization is 0.2410 is around _ v_0:124932 | surround and increasing _ v_17:059969 | expand is
0.1911 around _ v_0:124932 | surround and dynamics _ n_18:076991 | intensity is 0.0592 is around _ v_0:
124932 | around and actively _ a_20:057562 | positive is 0.3089 is around _ v_0:124932 | around and product
Pole _ a_20:057564 | positive is 0.0554 is around _ v_0:124932 | around and it is safe _ a_22:126267 | work as is
0.0245 around _ v_0:124932 | around and it is safe _ a_22:126269 | firm is 0.0490 is around _ v_0:
124932 | surround and propulsion _ v_24:122203 | advance is 0.1917 is around _ v_0:124932 | it is pushed away around and
Into _ v_24:122206 | mobilize is 0.0277 around _ v_0:124932 | around and propulsion _ v_24:122211 | push away is
0.1740 around _ v_0:124932 | surround and Chinese medicine _ n_25:157332 | knowledge is 0.2205 is around _ v_0:
124932 | surround and Chinese medicine _ n_25:157329 | people is-0.0686 is around _ v_0:124932 | around and mechanism _
N_27:057323 | mechanism is 0.0945 is around _ v_0:124932 | surround and mechanism _ n_27:057325 | component is
0.0582 around _ v_0:124932 | surround and mechanism _ n_27:057326 | component is 0.0582 ".Since length has
Limit, shows only part similarity result here.
Similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word
The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25:
Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3:
143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _
V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13:
109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16:
140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product
Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24:
122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine
Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
The similarity between each meaning of a word is calculated using the concept similarity kit that HowNet is provided, obtains and " Chinese medicine _ n_25: encloses
Around _ v_0:124932 and guidance _ vn_2:155807 is 0.015094 around _ v_0:124932 and opinion _ n_3:
143264 is 0.000624 are around _ v_0:124932 and opinion _ n_3:143267 is 0.010256 around _ v_0:
124932 and implement _ and v_6:047082 is 0.013793 around _ v_0:124932 and implements _ v_7:081572 is
0.010256 implement around _ v_0:124932 and _ v_7:081573 is 0.013793 is around _ v_0:124932 and
Practicable _ v_7:081575 is 0.013793 is around _ v_0:124932 and combination _ v_9:064548 is 0.016667
Around _ v_0:124932 and combination _ v_9:064549 is 0.018605 around _ v_0:124932 and traditional Chinese medicine _ n_
10:157339 is 0.000624 around _ v_0:124932 and work _ vn_11:044065 is 0.000624 around _
V_0:124932 and work _ vn_11:044067 is 0.000624 is around _ v_0:124932 and work _ vn_11:
044068 is 0.015094 surrounds _ v_0 around _ v_0:124932 and reality _ n_13:109077 is 0.000624:
124932 and reality _ n_13:109078 is 0.000624 want _ v_16:140522 is around _ v_0:124932 and
0.010959 want around _ v_0:124932 and _ v_16:140530 is 0.015094 is around _ v_0:124932 and
Want _ v_16:140532 is 0.018605 wants around _ v_0:124932 and _ v_16:140534 is 0.015094 encloses
Around _ v_0:124932 and increasing _ v_17:059967 is 0.013793 around _ v_0:124932 and increasing _ v_17:
059968 is 0.015094 is around _ v_0:124932 and increasing _ v_17:059969 is 0.013793 around _ v_0:
124932 and dynamics _ n_18:076991 is 0.000624 around _ v_0:124932 and actively _ a_20:057562
Is 0.000624 around _ v_0:124932 and actively _ a_20:057564 is 0.000624 is around _ v_0:124932
And is safe _ a_22:126267 is 0.000624 around _ v_0:124932 and it is safe _ a_22:126269 is
0.000624”。
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar
Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed
Disambiguate figure;As shown in Fig. 3, specific step is as follows for building disambiguation figure:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2
Automatic Optimal obtains optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute
Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability
P is that 1 selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p
For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability
Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew。
The partial code of weight optimization algorithm based on simulated annealing is as shown in the table:
In weight optimization algorithm based on simulated annealing, row 1 is initialization operation, and setting initial temperature value t is 100, temperature
Spending floor value t_min is 0.001, and cooling rate delta is set to 0.98, and greatest iteration step number k is set as 100;Row 2-3 be temperature with
And the control of iterative steps;Row 4-5, the double-precision value of random selection 0 to 1-y are x assignment, and are z assignment 1-x-y;Row 6, letter
Number getEvalResult (x, y, z) is objective function, function return value resulting disambiguation standard when being given weight parameter x, y, x
True rate;Row 7 selects new value to be assigned to x_new in the neighborhood of x;Row 8-18, determines whether x_new retains to replace x, is specifically shown in
The formula of simulated annealing progress parameter optimization;Row 20 changes t with the cooling rate of delta;Row 22 returns to x, y, z most
Excellent parameter combination.
Wherein, x, y, z indicates the weight variable of three kinds of similarity results, and when executing algorithm for the first time, y is set as 1/3, this
When algorithm after obtain the weight optimization parameter of x, y, at this moment min (x, y) is fixed up, continues to execute second of algorithm,
After algorithm, other two weight parameters can be determined.
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;
simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector
Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Citing: after weight optimization, merging three kinds of similarity values according to the similarity formula finally merged between the meaning of a word,
" Chinese medicine _ n_25: around _ v_0:124932 | revolve round | surround and guidance _ vn_2:155807 | direct | life
Enable is 0.015094 | 0.2929 | -0.0145 around _ v_0:124932 | revolve round | around and opinion _ n_
3:143264 | complaint | Chinese language is 0.000624 | 0.3085 | -0.0264 around _ v_0:124932 | revolve
Round | surround and opinion _ n_3:143267 | idea | thought is 0.010256 | 0.3742 | -0.0366
Around _ v_0:124932 | revolve round | _ v_6:047082 is implemented around and | carry out | implement
Is 0.013793 | 0.4015 | 0.2071 around _ v_0:124932 | revolve round | _ v_7 is implemented around and:
081572 | feel at ease | feel at ease is 0.010256 | and 0.3575 | -0.0430 around _ v_0:124932 | revolve
Round | around and implement _ v_7:081573 | ascertain | determine is 0.013793 | 0.3215 | 0.1502 around _
V_0:124932 | revolve round | _ v_7:081575 is implemented around and | fulfil | realize is 0.013793 |
0.3541 | 0.2254 around _ v_0:124932 | revolve round | surround and combination _ v_9:064548 | be
United in wedlock | marriage is 0.016667 | 0.3299 | -0.0183 around _ v_0:124932 | revolve
Round | surround and combination _ v_9:064549 | combination | merge is 0.018605 | 0.3487 | 0.0745 encloses
Around _ v_0:124932 | revolve round | surround and traditional Chinese medicine _ n_10:157339 | traditional Chinese
Medicine and druds | knowledge _ drug is 0.000624 | 0.3520 | 0.0866 around _ v_0:124932 |
Revolve round | surround and work _ vn_11:044068 | work | it is is 0.015094 | 0.3478 | 0.1434 encloses
Around _ v_0:124932 | revolve round | surround and reality _ n_13:109077 | reality | entity is
0.000624 | 0.3664 | 0.1503 around _ v_0:124932 | revolve round | surround and reality _ n_13:
109078 | practice | thing is 0.000624 | 0.3907 | -0.0571 around _ v_0:124932 | revolve round
| _ v_16:140522 is wanted around and | want to | expectation is 0.010959 | 0.3375 | 0.1009 around _ v_0:
124932 | revolve round | _ v_16:140530 is wanted around and | ask | it is required that is 0.015094 | 0.3482 |
0.2090 around _ v_0:124932 | revolve round | _ v_16:140532 is wanted around and | and ask for | seek is
0.018605 | 0.3648 | 0.0496 ", here in order to show that process does not further calculate, such as " 0.018605 | 0.3648 |
0.0496 " indicates three kinds of similarity values, is α 0.018605+ β 0.3648+ γ 0.0496 after their fusions.
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation
The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to a ternary
Group indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates the former word of the first justice
Language;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word
It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition
Former word, and then the meaning of a word is mapped for English word.
The meaning of a word of this triple form enables above-mentioned three kinds of similarity calculating methods to be integrated into an entirety, with
For " Chinese medicine ", " Chinese medicine " there are two the meaning of a word, correspond respectively to two meaning of a word triples, specific as follows: " Chinese medicine (157329,
People, practitioner of Chinese medicine) ", " Chinese medicine (157332, knowledge, traditional Chinese
Science) ", the side right weight in disambiguating figure between any two vertex, that is, the semantic similarity between the meaning of a word at this time, can be with
It is obtained by the similarity calculation finally merged between the meaning of a word.
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word
Score list selects highest scoring person as the correct meaning of a word.As shown in Fig. 4, selecting the correct meaning of a word, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes
After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure scoring is adopted
With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node
PageRank score depends on the PageRank score of all nodes linked with it;The specific PageRank score of one node
Calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node
Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in
It (v) is all nodes for being linked to node v.
Citing: after figure scoring, obtaining candidate meaning of a word list of concepts,
Chinese medicine _ n_25:157332 2.1213090873827947E58;
Chinese medicine _ n_25:157329 1.8434688340823378E58.
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is
The correct meaning of a word.
Citing: select meaning of a word concept highest scoring person for the correct meaning of a word, namely " Chinese medicine _ n_25:157332 ".
Embodiment 2:
As shown in Fig. 5, the present invention is based on the sense disambiguation systems of graph model, which includes,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to
Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector
And the similarity calculation based on HowNet.Similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected
Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English
Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet
English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to
Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector
Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used
It can tend to some more common meaning of a word of the ambiguity word;For this purpose, ambiguity word is converted into possessed by it using HowNet
The first justice in the meaning of a word, that is, each concept definition is former, as shown in fig. 6, ambiguity word " Chinese medicine " is converted to " people " and " is known
Know ".
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet
The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided
Degree.
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused
Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure
Build disambiguation figure;Disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English
Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain
Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute
Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability
P is that 1 selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p
For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability
Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew;
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;
simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector
Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould
The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three
Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former
Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word
It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition
Former word, and then the meaning of a word is mapped for English word.
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word
The score list of justice, selects score the maximum for the correct meaning of a word.The correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure;
After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented
Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot
The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node
Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node
Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in
It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is
The correct meaning of a word.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (10)
1. a kind of Word sense disambiguation method based on graph model, which comprises the steps of:
S1, extract Context Knowledge: to ambiguity sentences carry out part-of-speech tagging, extract notional word as Context Knowledge, notional word name word,
Verb, adjective, adverbial word;
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on
The similarity calculation of HowNet;
S3, building disambiguate figure: weight optimization carried out to similarity using simulated annealing, obtains fused similarity, into
And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, constructs disambiguation
Figure;
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the score of the candidate meaning of a word
List selects highest scoring person as the correct meaning of a word.
2. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that similar in the step S2
Specific step is as follows for degree calculating:
S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping
Processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English
Word carries out similarity calculation;
S202, the similarity calculation based on term vector: using the word2vec kit of Google on the corpus training word to
Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector
Similarity of the string similarity as the two;
S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word
The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.
3. the Word sense disambiguation method according to claim 2 based on graph model, which is characterized in that base in the step S201
It is specific as follows in the Word similarity algorithm of term vector and knowledge base:
What S20101, judgement gave is word or phrase:
If 1., it is given be two English words, by the cosine similarity of two word vectors of calculating obtain two words it
Between similarity;
If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain the vector table of phrase
Show, acquire the similarity of phrase, formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word, p2
In j-th of word;
S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ;
S20103, synset is constructed based on two English words and synset relevant to two English words
Figure;
S20104, in set distance range, the registration of synset relevant to two English words, formula are calculated in figure
It is as follows:
simlap(wi, wj)=d*count (wi, wj)/(count(wi)+count(wj))
In formula, count (wi, wj) indicate word wiAnd wjThe synset number having jointly;count(wi) and count (wj)
Respectively wiAnd wjThe synset number respectively having;The value of d expression set distance range;
S20105, w in figure is calculated using dijkstra's algorithmiAnd wjBetween shortest path, obtain wiWith the similarity of w, formula
It is as follows:
simbn(wi, wj1/ (δ of)=α *path)+(1-α)simlap(wi, wj)
Wherein, path is wiAnd wjBetween shortest path;Value of the δ to adjust similarity;simlap(wi, wj) indicate wiWith
wjBetween registration;Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula;
S20106, the similarity sim for obtaining vector approach word-based in step S20101vecWith in step S20105 based on knowing
Know the similarity sim that library method obtainsbn, linear, additive combination is carried out, obtains final similarity, formula is as follows:
simfinal(wi, wj)=β * simvec+(1-β)*simbn
Wherein, simbnAnd simvecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains
Similarity;Parameter alpha is a regulatory factor, adjusts the similarity knot that knowledge based library method and word-based vector approach obtain
Fruit;
S20107, similarity sim is returnedfinal。
4. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that constructed in the step S3
Disambiguating figure, specific step is as follows:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 automatic
Optimization, obtains optimal weights parameter;
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;simen
The Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on the similar of term vector
Spend calculated result, weight γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
S303, it constructs to disambiguate and scheme: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulated annealing
Weight optimization algorithm, integrate three kinds of similarity values as the side right weight between the meaning of a word.
5. the Word sense disambiguation method according to claim 4 based on graph model, which is characterized in that in the step S301
The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T expression is presently in temperature
Degree;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with Probability p be 1
Selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then be exp with Probability p
((result(xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, generate a probability value at random, and sentence
The size of the disconnected probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew;
The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword);Wherein,
No. what is indicated is concept number;Sword indicates the first sense word;Enword indicates English word;No.,Sword,Enword
Three is the entirety of organic unity, describes the same meaning of a word concept;A meaning of a word concept number unique identification one in HowNet
A meaning of a word, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.
6. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that selected in the step S4
Specific step is as follows for the correct meaning of a word:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;Completion figure is commented
After point, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is correct
The meaning of a word.
7. the Word sense disambiguation method according to claim 6 based on graph model, which is characterized in that scheme in the step S401
Scoring uses PageRank algorithm, and PageRank algorithm is to be assessed based on Markov chain model node in figure, and one
The PageRank score of node depends on the PageRank score of all nodes linked with it;One node it is specific
PageRank score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node;α is
Refer to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;In (v) is chain
It is connected to all nodes of node v.
8. a kind of sense disambiguation systems based on graph model, which is characterized in that the system includes,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word is named
Word, verb, adjective, adverbial word;
Similarity calculated, for do respectively based on English similarity calculation, the similarity calculation based on term vector and
Similarity calculation based on HowNet;
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused similar
Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disappears
Discrimination figure;
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word
Score list selects score the maximum for the correct meaning of a word.
9. the sense disambiguation systems according to claim 8 based on graph model, which is characterized in that the similarity calculation list
Member includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done at meaning of a word mapping
Reason, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English words
Language carries out similarity calculation;
Term vector similarity calculated, for using the word2vec kit of Google training term vector on the corpus,
Term vector file is obtained, the corresponding term vector of two words is given according to term vector file acquisition, calculates the cosine between term vector
Similarity of the similarity as the two;
HowNet similarity calculated, for carrying out word sense information mark to Context Knowledge using HowNet, using word
The form of vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided;
The disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to the similarity calculation, word-based based on English
Three kinds of similarity values of the similarity calculation of vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain optimal
Weight parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T expression is presently in temperature
Degree;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with Probability p be 1
Selection newly takes parameter xnew;
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then be exp with Probability p
((result(xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, generate a probability value at random, and sentence
The size of the disconnected probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew;
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew;
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;simen
The Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on the similar of term vector
Spend calculated result, weight γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, moves back using based on simulation
The weight optimization algorithm of fire, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to a ternary
Group indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates the former word of the first justice
Language;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word
It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition
Former word, and then the meaning of a word is mapped for English word.
10. the sense disambiguation systems based on graph model according to claim 8 or claim 9, which is characterized in that the meaning of a word is correct
Selecting unit includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure;It completes
After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure scoring is adopted
With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node
PageRank score depends on the PageRank score of all nodes linked with it;The specific PageRank score of one node
Calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node;α is
Refer to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;In (v) is chain
It is connected to all nodes of node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is correct
The meaning of a word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811503355.7A CN109359303B (en) | 2018-12-10 | 2018-12-10 | Word sense disambiguation method and system based on graph model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811503355.7A CN109359303B (en) | 2018-12-10 | 2018-12-10 | Word sense disambiguation method and system based on graph model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109359303A true CN109359303A (en) | 2019-02-19 |
CN109359303B CN109359303B (en) | 2023-04-07 |
Family
ID=65332018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811503355.7A Active CN109359303B (en) | 2018-12-10 | 2018-12-10 | Word sense disambiguation method and system based on graph model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109359303B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362691A (en) * | 2019-07-19 | 2019-10-22 | 大连语智星科技有限公司 | A kind of tree bank building system |
CN110413989A (en) * | 2019-06-19 | 2019-11-05 | 北京邮电大学 | A kind of text field based on domain semantics relational graph determines method and system |
CN110598209A (en) * | 2019-08-21 | 2019-12-20 | 合肥工业大学 | Method, system and storage medium for extracting keywords |
CN110705295A (en) * | 2019-09-11 | 2020-01-17 | 北京航空航天大学 | Entity name disambiguation method based on keyword extraction |
CN110766072A (en) * | 2019-10-22 | 2020-02-07 | 探智立方(北京)科技有限公司 | Automatic generation method of computational graph evolution AI model based on structural similarity |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111783418A (en) * | 2020-06-09 | 2020-10-16 | 北京北大软件工程股份有限公司 | Chinese meaning representation learning method and device |
CN112256885A (en) * | 2020-10-23 | 2021-01-22 | 上海恒生聚源数据服务有限公司 | Label disambiguation method, device, equipment and computer readable storage medium |
CN113158687A (en) * | 2021-04-29 | 2021-07-23 | 新声科技(深圳)有限公司 | Semantic disambiguation method and device, storage medium and electronic device |
CN115114397A (en) * | 2022-05-09 | 2022-09-27 | 泰康保险集团股份有限公司 | Annuity information updating method, device, electronic device, storage medium, and program |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002017128A1 (en) * | 2000-08-24 | 2002-02-28 | Science Applications International Corporation | Word sense disambiguation |
WO2014087506A1 (en) * | 2012-12-05 | 2014-06-12 | 三菱電機株式会社 | Word meaning estimation device, word meaning estimation method, and word meaning estimation program |
WO2016050066A1 (en) * | 2014-09-29 | 2016-04-07 | 华为技术有限公司 | Method and device for parsing interrogative sentence in knowledge base |
CN105760363A (en) * | 2016-02-17 | 2016-07-13 | 腾讯科技(深圳)有限公司 | Text file word sense disambiguation method and device |
CN105893346A (en) * | 2016-03-30 | 2016-08-24 | 齐鲁工业大学 | Graph model word sense disambiguation method based on dependency syntax tree |
CN106951684A (en) * | 2017-02-28 | 2017-07-14 | 北京大学 | A kind of method of entity disambiguation in medical conditions idagnostic logout |
WO2017217661A1 (en) * | 2016-06-15 | 2017-12-21 | 울산대학교 산학협력단 | Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN108959461A (en) * | 2018-06-15 | 2018-12-07 | 东南大学 | A kind of entity link method based on graph model |
-
2018
- 2018-12-10 CN CN201811503355.7A patent/CN109359303B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002017128A1 (en) * | 2000-08-24 | 2002-02-28 | Science Applications International Corporation | Word sense disambiguation |
WO2014087506A1 (en) * | 2012-12-05 | 2014-06-12 | 三菱電機株式会社 | Word meaning estimation device, word meaning estimation method, and word meaning estimation program |
WO2016050066A1 (en) * | 2014-09-29 | 2016-04-07 | 华为技术有限公司 | Method and device for parsing interrogative sentence in knowledge base |
CN105760363A (en) * | 2016-02-17 | 2016-07-13 | 腾讯科技(深圳)有限公司 | Text file word sense disambiguation method and device |
CN105893346A (en) * | 2016-03-30 | 2016-08-24 | 齐鲁工业大学 | Graph model word sense disambiguation method based on dependency syntax tree |
WO2017217661A1 (en) * | 2016-06-15 | 2017-12-21 | 울산대학교 산학협력단 | Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding |
CN106951684A (en) * | 2017-02-28 | 2017-07-14 | 北京大学 | A kind of method of entity disambiguation in medical conditions idagnostic logout |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN108959461A (en) * | 2018-06-15 | 2018-12-07 | 东南大学 | A kind of entity link method based on graph model |
Non-Patent Citations (1)
Title |
---|
鹿文鹏: "基于依存和领域知识的词义消歧方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413989A (en) * | 2019-06-19 | 2019-11-05 | 北京邮电大学 | A kind of text field based on domain semantics relational graph determines method and system |
CN110413989B (en) * | 2019-06-19 | 2020-11-20 | 北京邮电大学 | Text field determination method and system based on field semantic relation graph |
CN110362691A (en) * | 2019-07-19 | 2019-10-22 | 大连语智星科技有限公司 | A kind of tree bank building system |
CN110598209A (en) * | 2019-08-21 | 2019-12-20 | 合肥工业大学 | Method, system and storage medium for extracting keywords |
CN110705295A (en) * | 2019-09-11 | 2020-01-17 | 北京航空航天大学 | Entity name disambiguation method based on keyword extraction |
CN110705295B (en) * | 2019-09-11 | 2021-08-24 | 北京航空航天大学 | Entity name disambiguation method based on keyword extraction |
CN110766072A (en) * | 2019-10-22 | 2020-02-07 | 探智立方(北京)科技有限公司 | Automatic generation method of computational graph evolution AI model based on structural similarity |
CN111310475B (en) * | 2020-02-04 | 2023-03-10 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111783418A (en) * | 2020-06-09 | 2020-10-16 | 北京北大软件工程股份有限公司 | Chinese meaning representation learning method and device |
CN111783418B (en) * | 2020-06-09 | 2024-04-05 | 北京北大软件工程股份有限公司 | Chinese word meaning representation learning method and device |
CN112256885A (en) * | 2020-10-23 | 2021-01-22 | 上海恒生聚源数据服务有限公司 | Label disambiguation method, device, equipment and computer readable storage medium |
CN112256885B (en) * | 2020-10-23 | 2023-10-27 | 上海恒生聚源数据服务有限公司 | Label disambiguation method, device, equipment and computer readable storage medium |
CN113158687B (en) * | 2021-04-29 | 2021-12-28 | 新声科技(深圳)有限公司 | Semantic disambiguation method and device, storage medium and electronic device |
CN113158687A (en) * | 2021-04-29 | 2021-07-23 | 新声科技(深圳)有限公司 | Semantic disambiguation method and device, storage medium and electronic device |
CN115114397A (en) * | 2022-05-09 | 2022-09-27 | 泰康保险集团股份有限公司 | Annuity information updating method, device, electronic device, storage medium, and program |
CN115114397B (en) * | 2022-05-09 | 2024-05-31 | 泰康保险集团股份有限公司 | Annuity information updating method, annuity information updating device, electronic device, storage medium, and program |
Also Published As
Publication number | Publication date |
---|---|
CN109359303B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359303A (en) | A kind of Word sense disambiguation method and system based on graph model | |
CN104915340B (en) | Natural language question-answering method and device | |
KR101850124B1 (en) | Evaluating query translations for cross-language query suggestion | |
Ramisch et al. | mwetoolkit: A framework for multiword expression identification. | |
US9514098B1 (en) | Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases | |
CN106815252A (en) | A kind of searching method and equipment | |
KR20180125746A (en) | System and Method for Sentence Embedding and Similar Question Retrieving | |
CN109614620A (en) | A kind of graph model Word sense disambiguation method and system based on HowNet | |
Xie et al. | Knowledge base question answering based on deep learning models | |
Nishihara et al. | Word complexity estimation for Japanese lexical simplification | |
Kanojia et al. | Challenge dataset of cognates and false friend pairs from indian languages | |
Park et al. | Frame-Semantic Web: a Case Study for Korean. | |
Nakazawa et al. | Example-based machine translation based on deeper NLP | |
CN108255818B (en) | Combined machine translation method using segmentation technology | |
Harshawardhan et al. | Phrase based English-Tamil translation system by concept labeling using translation memory | |
Behera | Odia parts of speech tagging corpora: suitability of statistical models | |
CN104317888B (en) | A kind of full-text search test data generating method | |
Simova et al. | Joint ensemble model for POS tagging and dependency parsing | |
Huang et al. | A simple, straightforward and effective model for joint bilingual terms detection and word alignment in SMT | |
CN108280066B (en) | Off-line translation method from Chinese to English | |
Marinova | Evaluation of stacked embeddings for Bulgarian on the downstream tasks POS and NERC | |
Passban et al. | Improving phrase-based SMT using cross-granularity embedding similarity | |
Mascarell et al. | Detecting document-level context triggers to resolve translation ambiguity | |
TWI492072B (en) | Input system and input method | |
Pathak et al. | Natural Language Chhattisgarhi: A Literature Survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20190219 Assignee: SHANDONG ZHENGKAI NEW MATERIALS CO.,LTD. Assignor: ZAOZHUANG University Contract record no.: X2024980014476 Denomination of invention: A method and system for word sense disambiguation based on graph model Granted publication date: 20230407 License type: Common License Record date: 20240912 |