CN105718443A

CN105718443A - Adjective word sense disambiguation method based on dependency vocabulary association degree

Info

Publication number: CN105718443A
Application number: CN201610048601.9A
Authority: CN
Inventors: 鹿文鹏
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2016-01-26
Filing date: 2016-01-26
Publication date: 2016-06-29

Abstract

The invention relates to an adjective word sense disambiguation method based on the dependency vocabulary association degree and belongs to the technical field of natural language processing.The method includes the steps that firstly, according to a semantic dictionary, synonyms, near-synonyms and antonyms of all word senses of a target adjective ambiguous word are collected, and a relevant word set of the corresponding word senses is established; secondly, a sentence where the target ambiguous word is located is subjected to dependency grammar analysis, an adjective embellished and adverb embellished dependency tuple containing the target ambiguous word is collected, and corresponding dependency co-occurrence words are extracted; thirdly, large-scale corpus is subjected to dependency parsing, dependency co-occurrence word pairs in the large-scale corpus are collected, and a dependency co-occurrence word pair database DB is established; fourthly, according to the DB, the dependency vocabulary association degree of various word senses of the target ambiguous word is calculated; fifthly, the word sense with the largest overall dependency vocabulary association degree is judged as a correct word sense.Compared with the prior art, dependency co-occurrence words can be accurately selected, and interference of noise words is avoided; the dependency co-occurrence word pair database can be automatically established, and no manual assisting operation is needed; the adjective word sense disambiguation effect can be improved.

Description

A kind of adjective word sense disambiguation method based on interdependent vocabulary association degree

Technical field

The present invention relates to a kind of adjective word sense disambiguation method, be related specifically to a kind of adjective word sense disambiguation method based on interdependent vocabulary association degree, belong to natural language processing technique field.

Background technology

The phenomenon of ubiquity polysemy in natural language.Namely word sense disambiguation refers to that the context environmental residing for polysemant automatically determines its meaning of a word.Word sense disambiguation belongs to the bottom research of natural language processing field, machine translation, information retrieval, information extraction, sentiment analysis, public sentiment monitoring etc. is respectively provided with and directly affects.

Word sense disambiguation method can divide the method in measure of supervision, unsupervised approaches and knowledge based storehouse.Measure of supervision is had to utilize meaning of a word grader to carry out the judgement of the meaning of a word；The meaning of a word is classified by unsupervised approaches mainly through the context words of ambiguity word is clustered；The method in knowledge based storehouse based on context environment, utilizes knowledge base to judge the meaning of a word of ambiguity word.Having measure of supervision to need substantial amounts of word sense tagging language material to train meaning of a word grader, this seriously constrains its range of application；Unsupervised approaches is substantially a kind of meaning of a word discrimination method, can not really be applied to extensive word sense disambiguation task；The method in knowledge based storehouse needs to use substantial amounts of knowledge base, and the quality of knowledge base directly affects its disambiguation ability.Wherein, the method in knowledge based storehouse is the currently the only method that can really be applied to extensive word sense disambiguation task.

The method in knowledge based storehouse needs the context environmental in conjunction with ambiguity word, judges the meaning of a word of ambiguity word according to its knowledge base.Existing method generally utilizes sliding window to carry out the selection of context, and this can introduce the noise word that some are unrelated unavoidably；The knowledge base that existing method uses is usually artificial constructed, and it is with high costs, is not easy to extension；Existing method often and does not differentiate between the part of speech of ambiguity word, fails to make full use of the unique characteristics of different part of speech ambiguity word.

Summary of the invention

The invention aims to overcome the deficiencies in the prior art, mainly solve adjectival word sense disambiguation problem, it is proposed to a kind of adjective word sense disambiguation method based on interdependent vocabulary association degree.

It is an object of the invention to be achieved through the following technical solutions.

A kind of adjective word sense disambiguation method based on interdependent vocabulary association degree, its concrete operation step is as follows.

Step one, according to semantic dictionary, collect target adjective ambiguity word w_tThe synonym of each meaning of a word si, near synonym, antonym, build the relevant word set W of the corresponding meaning of a word_si；Specific as follows.

Step 1.1: according to WordNet, take the synset of meaning of a word concept si.

Step 1.2: according to WordNet, take the near synonym collection of meaning of a word concept si.

Step 1.3: according to WordNet, take the antisense word set of meaning of a word concept si.

Step 1.4: by the synset of step 1.1 ~ 1.3 gained, near synonym collection, antonym set also, builds the relevant word set W of the corresponding meaning of a word_si。

Step 2, sentence to target ambiguities word place carry out interdependent syntactic analysis, collect the adjective comprising target ambiguities word and modify and interdependent tuple modified in adverbial word, extract corresponding interdependent co-occurrence word w_amodAnd w_advmod；Specific as follows.

Step 2.1: utilize interdependent syntactic analysis instrument that the sentence at target ambiguities word place is carried out interdependent syntactic analysis, obtain its interdependent tuple-set.

Step 2.2: by the interdependent tuple-set of step 2.1 gained, extracts the adjective comprising target ambiguities word and modifies and the adverbial word interdependent tuple of modification.

Step 2.3: by the interdependent tuple of step 2.2 gained, extract the interdependent co-occurrence notional word w of ambiguity word_amodAnd w_advmod。

Step 3, large-scale corpus is carried out interdependent syntactic analysis, collect interdependent co-occurrence word pair therein, build interdependent co-occurrence word pair database DB；Specific as follows.

Step 3.1: utilize interdependent syntactic analysis instrument that extensive corpus of text is carried out interdependent syntactic analysis, obtains its interdependent tuple-set DSet.

Step 3.2: give up the dependency relationship type information of interdependent tuple in DSet, add up interdependent co-occurrence word pair, builds interdependent co-occurrence word pair database DB.

Step 4, according to DB, calculate the interdependent vocabulary association degree of each meaning of a word of target ambiguities word；Specific as follows.

Step 4.1: for the relevant word set W of meaning of a word si_siIn each related term w_si, by formula (1), calculate itself and w_amod、w_advmodInterdependent vocabulary association degree, i.e. relatedness (w_amod,w_si) and relatedness (w_si,w_advmod)。

relatedness(w₁,w₂)=LLR(w₁,w₂)=2[LogL(p₁,a,a+b)+LogL(p₂,c,c+d)-LogL(p,a,a+b)-LogL(p,c,c+d)](1)

Wherein,

；

a=freq(w₁,w₂) represent that governing word is w₁, and dependent is w₂The sum of interdependent tuple；

b=freq(w₁, *) and-a represents that governing word is w₁, but dependent is not w₂The sum of interdependent tuple；

c=freq(*,w₂)-a represents that dependent is w₂, but governing word is not w₁The sum of interdependent tuple；

D=N-a-b-c represents that governing word is not w₁And dependent is not w₂The sum of interdependent tuple；

N represents the sum of the whole interdependent tuple that corpus comprises.

Step 4.2: by formula (2), calculates meaning of a word si and interdependent co-occurrence word w_amodAnd w_advmodThe interdependent vocabulary association degree of entirety.

relatedness(si)=relatedness(w_amod,W_si)+relatedness(W_si,w_advmod)(2)

Wherein,

；

W_siRepresent the relevant word set of the meaning of a word si obtained by step one.

Step 5, the meaning of a word maximum for overall interdependent vocabulary association degree is judged to the correct meaning of a word；Specific as follows.

The interdependent vocabulary association degree of entirety of each meaning of a word relatively obtained by step 4.2, is judged to the correct meaning of a word of ambiguity word by the meaning of a word maximum for interdependent vocabulary association degree.

Through the operation of above step, namely can determine that the meaning of a word of adjective ambiguity word, complete word sense disambiguation task.

Beneficial effect

The adjective word sense disambiguation method based on interdependent vocabulary association degree that the present invention proposes, interdependent syntactic analysis is utilized to obtain interdependent co-occurrence word for adjective, the interdependent vocabulary association degree of the meaning of a word is calculated, thus judging the adjectival correct meaning of a word according to the interdependent co-occurrence word pair database automatically obtained.Compared with traditional Word sense disambiguation method, the method that the present invention proposes can select interdependent co-occurrence word more accurately for adjectival feature, is prevented effectively from the interference of uncorrelated noise word；Can automatically build interdependent co-occurrence word pair database, it is not necessary to any artificial auxiliary operation, it is easy to data base is extended.The method that the present invention proposes can improve the effect of adjective word sense disambiguation.

Detailed description of the invention

Below in conjunction with example, the specific embodiment of the present invention is described in further details.

For sentence " Thelargenumberofmentallyillpeopletendtocommitsuicideinmo stdevelopedcountries. ", adjective ambiguity word ill, developed therein are carried out disambiguation process.

According to WordNet3.0 dictionary, the meaning of a word of adjective ambiguity word ill, developed is such as shown in table 1, table 2.

The meaning of a word table of table 1 adjective ill

The meaning of a word is numbered	Meaning of a word explanation
		ill#a#1	ill, sick -- (affected by an impairment of normal physical or mental function; "ill from the monotony of his suffering")
ill#a#2	ill -- (resulting in suffering or adversity; "ill effects"; "it's an ill wind that blows no good")
		ill#a#3	ill -- (distressing; "ill manners"; "of ill repute")
ill#a#4	ill -- (indicating hostility or enmity; "you certainly did me an ill turn"; "ill feelings"; "ill will")
		ill#a#5	ill, inauspicious, ominous -- (presaging ill fortune; "ill omens"; "ill predictions"; "a by-election at a time highly unpropitious for the government")

Wherein, #a represents that part of speech is adjective, and #1 ~ #5 represents meaning of a word sequence number.

The meaning of a word table of table 2 adjective developed

The meaning of a word is numbered	Meaning of a word explanation
		developed#a#1	developed -- (being changed over time so as to be e.g. stronger or more complete or more useful; "they have very small limbs with only two fully developed toes on each")
developed#a#2	developed, highly-developed -- ((used of societies) having high industrial development; "developed countries")
		developed#a#3	developed -- ((of real estate) made more useful and profitable as by building or laying out roads; "condominiums were built on the developed site")

Wherein, #a represents that part of speech is adjective, and #1 ~ #3 represents meaning of a word sequence number.

In this instance, according to WordNet, the synonym of each meaning of a word of ill and developed can be obtained such as shown in table 3, table 4.

In this instance, according to WordNet, the near synonym of each meaning of a word of ill and developed can be obtained such as shown in table 3, table 4.

In this instance, according to WordNet, the antonym of each meaning of a word of ill and developed can be obtained such as shown in table 3, table 4.

In this instance, the relevant word set of each meaning of a word of ill and developed can be obtained such as shown in table 5, table 6.

The related term of each meaning of a word of table 3 adjective ill

The meaning of a word is numbered	Synonym	Near synonym	Antonym 3-->
				ill#a#1	sick	afflicted stricken aguish ailing indisposed peaked poorly sickly unwell seedy airsick carsick seasick autistic bedfast bedridden bedrid sick-abed bilious liverish livery bronchitic consumptive convalescent recovering delirious hallucinating diabetic dizzy giddy woozy vertiginous dyspeptic faint light swooning light-headed lightheaded feverish feverous funny gouty green milk-sick nauseated nauseous queasy sickish palsied paralytic paralyzed paraplegic rickety rachitic scrofulous sneezy spastic tubercular tuberculous unhealed upset	well
ill#a#2	-	harmful	-
				ill#a#3	-	bad	-
ill#a#4	-	hostile	-
				ill#a#5	inauspicious ominous	unpropitious	-

The related term of each meaning of a word of table 4 adjective developed

The meaning of a word is numbered	Synonym	Near synonym	Antonym
				developed#a#1	-	formed formulated mature matured	undeveloped
developed#a#2	highly-developed	industrial	-
				developed#a#3	-	improved	-

Each meaning of a word of table 5 adjective ill relevant word set

The meaning of a word is numbered	Relevant word set
		ill#a#1	sick afflicted stricken aguish ailing indisposed peaked poorly sickly unwell seedy airsick carsick seasick autistic bedfast bedridden bedrid sick-abed bilious liverish livery bronchitic consumptive convalescent recovering delirious hallucinating diabetic dizzy giddy woozy vertiginous dyspeptic faint light swooning light-headed lightheaded feverish feverous funny gouty green milk-sick nauseated nauseous queasy sickish palsied paralytic paralyzed paraplegic rickety rachitic scrofulous sneezy spastic tubercular tuberculous unhealed upset well
ill#a#2	harmful
		ill#a#3	bad
ill#a#4	hostile
		ill#a#5	inauspicious ominous unpropitious

The relevant word set of each meaning of a word of table 6 adjective developed

The meaning of a word is numbered	Relevant word set
		developed#a#1	formed formulated mature matured undeveloped
developed#a#2	highly-developed industrial
		developed#a#3	improved

In this example, by the StanfordParser parser that Stanford University provides, use englishPCFG.ser.gz language model, and Use Word Net3.0 carries out lemmatization, the interdependent tuple-set that can obtain sentence is as follows: det (number-3, the-1), amod (number-3, large-2), nsubj (tend-8, number-3), xsubj (commit-10, number-3), advmod (ill-6, mentally-5), amod (people-7, ill-6), prep_of (number-3, people-7), aux (commit-10, to-9), xcomp (tend-8, commit-10), dobj (commit-10, suicide-11), advmod (developed-14, most-13), amod (country-15, developed-14), prep_in (suicide-11, country-15).

In this example, for ambiguity word ill, amod (people-7, ill-6) and advmod (ill-6, mentally-5) can be extracted；For ambiguity word developed, amod (country-15, developed-14) and advmod (developed-14, most-13) can be extracted.

In this example, for ambiguity word ill, w can be obtained_amodFor people, w_advmodFor mentally；For ambiguity word developed, w can be obtained_amodFor country, w_advmodFor most.

In this example, interdependent syntactic analysis instrument adopts the StanfordParser parser that Stanford University provides, and uses englishPCFG.ser.gz language model, and Use Word Net3.0 carries out lemmatization.Extensive corpus of text adopts the ReuterCorpus that Reuter provides.Utilize StanfordParser that the corpus of text in ReuterCorpus carries out syntactic analysis sentence by sentence, collect the interdependent tuple obtained, be stored in interdependent tuple-set DSet.In this example, the DSet finally given comprises 93850841 interdependent tuples altogether.

In this example, the interdependent tuple in DSet is given up dependency relationship type information, only retain governing word and dependent information, the co-occurrence frequency information of the interdependent co-occurrence word pair that statistics governing word and dependent are constituted, build interdependent co-occurrence word pair database DB.

In this example, comprising altogether and deposit co-occurrence word to 9269109 pairs in the DB finally given, its co-occurrence frequency summation is 93850841.

Wherein,

；

N represents the sum of the whole interdependent tuple that corpus comprises.

In this example, for ambiguity word ill, its w_amodFor people, w_advmodFor mentally, by formula (1), calculate the interdependent vocabulary association degree of its each meaning of a word related term.

Wherein the meaning of a word related term sick of ill#a#1, sickly, light, funny, green and people interdependent vocabulary association degree be respectively as follows: 414.633560,2.797437,10.267433,10.214535,3.727571；The degree of association of other meaning of a word related term is 0.

The interdependent vocabulary association degree of meaning of a word related term sick and the mentally of ill#a#1 is: 36.692474；The degree of association of other meaning of a word related term is 0.

Meaning of a word related term harmful and the people of ill#a#2, mentally interdependent vocabulary association degree be 0.

Meaning of a word related term bad and the people of ill#a#3, mentally interdependent vocabulary association degree respectively 0.703737,0.

Meaning of a word related term hostile and the people of ill#a#4, mentally interdependent vocabulary association degree respectively 0.609087,0.

The interdependent vocabulary association degree of the meaning of a word related term inauspicious, ominous, unpropitious and people, mentally of ill#a#5 is 0.

For ambiguity word developed, its w_amodFor country, w_advmodFor most, by formula (1), calculate the interdependent vocabulary association degree of its each meaning of a word related term.

Wherein, the interdependent word-correlativity respectively 0,0,0,0,22.751748 of the meaning of a word related term formed of developed#a#1, formulated, mature, matured, undeveloped and country；Its interdependent word-correlativity with most respectively 0,0,7.076829,0,1.862240.

The meaning of a word related term highly-developed of developed#a#2, industrial and country interdependent word-correlativity respectively 0,611.842281；Its interdependent word-correlativity with most respectively 0,16.894161.

The interdependent word-correlativity of meaning of a word related term improved and the country of developed#a#3 is 0；Its interdependent word-correlativity with most is 0.

relatedness(si)=relatedness(w_amod,W_si)+relatedness(W_si,w_advmod)(2)

Wherein,

；

In this example, for ambiguity word ill, relatedness (ill#a#1)=relatedness (" people ", W_ill#n#1)+relatedness(W_ill#n#1,“mentally”)=max(414.633560,2.797437,10.267433,10.214535,3.727571,0,0,…,0)+max(36.692474,0,0,…,0)=414.633560+36.692474=451.326034。

In like manner, relatedness (ill#a#2)=0；Relatedness (ill#a#3)=0.703737；Relatedness (ill#a#4)=0.609087；Relatedness (ill#a#5)=0.

For ambiguity word developed, relatedness (developed#a#1)=relatedness (" country ", W_{developed#a#1})+relatedness(W_{developed#a#1},“most”)=max(0,0,0,0,22.751748)+max(0,0,7.076829,0,1.862240)=22.751748+7.076829=29.828577。

In like manner, relatedness (developed#a#2)=628.736442；Relatedness (developed#a#3)=0.

In this instance, for ambiguity word ill, by step 4.2, the interdependent vocabulary association degree of its ill#a#1, ill#a#2, ill#a#3, ill#a#4, ill#a#5 is respectively as follows: 451.326034,0,0.703737,0.609087,0；Visible, the interdependent vocabulary association degree of ill#a#1 is maximum, and it will be judged as the correct meaning of a word of ambiguity word ill.

For ambiguity word developed, by step 4.2, the interdependent vocabulary association degree of its developed#a#1, developed#a#2, developed#a#3 is respectively as follows: 29.828577,628.736442,0；Visible, the interdependent vocabulary association degree of developed#a#2 is maximum, and it will be judged as the correct meaning of a word of ambiguity word developed.

As it has been described above, the invention provides a kind of adjective word sense disambiguation method based on interdependent vocabulary association degree.User inputs sentence and indicates target adjective ambiguity word, and the adjectival meaning of a word of target will be judged by system automatically.

Above-described specific descriptions; the purpose of invention, technical scheme and beneficial effect have been described in detail; it is it should be understood that; the foregoing is only specific embodiments of the invention; the protection domain being not intended to limit the present invention; all within the spirit and principles in the present invention, any amendment of making, equivalent replacement, improvement etc., should be included within protection scope of the present invention.

Claims

1. the adjective word sense disambiguation method based on interdependent vocabulary association degree, it is characterised in that: its concrete operation step is:

Step one, according to semantic dictionary, collect target adjective ambiguity word w_tThe synonym of each meaning of a word si, near synonym, antonym, build the relevant word set W of the corresponding meaning of a word_si；Particularly as follows:

Step 1.1: according to WordNet, take the synset of meaning of a word concept si；

Step 1.2: according to WordNet, take the near synonym collection of meaning of a word concept si；

Step 1.3: according to WordNet, take the antisense word set of meaning of a word concept si；

Step 1.4: by the synset of step 1.1 ~ 1.3 gained, near synonym collection, antonym set also, builds the relevant word set W of the corresponding meaning of a word_si；

Step 2, sentence to target ambiguities word place carry out interdependent syntactic analysis, collect the adjective comprising target ambiguities word and modify and interdependent tuple modified in adverbial word, extract corresponding interdependent co-occurrence word w_amodAnd w_advmod；Particularly as follows:

Step 2.1: utilize interdependent syntactic analysis instrument that the sentence at target ambiguities word place is carried out interdependent syntactic analysis, obtain its interdependent tuple-set；

Step 2.2: by the interdependent tuple-set of step 2.1 gained, extracts the adjective comprising target ambiguities word and modifies and the adverbial word interdependent tuple of modification；

Step 2.3: by the interdependent tuple of step 2.2 gained, extract the interdependent co-occurrence notional word w of ambiguity word_amodAnd w_advmod；

Step 3, large-scale corpus is carried out interdependent syntactic analysis, collect interdependent co-occurrence word pair therein, build interdependent co-occurrence word pair database DB；Particularly as follows:

Step 3.1: utilize interdependent syntactic analysis instrument that extensive corpus of text is carried out interdependent syntactic analysis, obtains its interdependent tuple-set DSet；

Step 3.2: give up the dependency relationship type information of interdependent tuple in DSet, add up interdependent co-occurrence word pair, builds interdependent co-occurrence word pair database DB；

Step 4, according to DB, calculate the interdependent vocabulary association degree of each meaning of a word of target ambiguities word；Particularly as follows:

Step 4.1: for the relevant word set W of meaning of a word si_siIn each related term w_si, by formula (1), calculate itself and w_amod、w_advmodInterdependent vocabulary association degree, i.e. relatedness (w_amod,w_si) and relatedness (w_si,w_advmod)；

Wherein,

；

N represents the sum of the whole interdependent tuple that corpus comprises；

Step 4.2: by formula (2), calculates meaning of a word si and interdependent co-occurrence word w_amodAnd w_advmodThe interdependent vocabulary association degree of entirety；

relatedness(si)=relatedness(w_amod,W_si)+relatedness(W_si,w_advmod)(2)

Wherein,

；

W_siRepresent the relevant word set of the meaning of a word si obtained by step one；

Step 5, the meaning of a word maximum for overall interdependent vocabulary association degree is judged to the correct meaning of a word；Particularly as follows:

The interdependent vocabulary association degree of entirety of each meaning of a word relatively obtained by step 4.2, is judged to the correct meaning of a word of ambiguity word by the meaning of a word maximum for interdependent vocabulary association degree；