CN105677639A

CN105677639A - English word sense disambiguation method based on phrase structure syntax tree

Info

Publication number: CN105677639A
Application number: CN201610011045.8A
Authority: CN
Inventors: 鹿文鹏; 成金勇; 张维玉
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2016-01-10
Filing date: 2016-01-10
Publication date: 2016-06-15

Abstract

The invention relates to an English word sense disambiguation method based on a phrase structure syntax tree, and belongs to natural language processing. The method comprises the steps that 1, phrase structure syntax analysis is conducted on a sentence, and a phrase structure syntax tree of the sentence is generated; 2, word sense relevant words are screened by taking the phrase structure syntax tree as the basis; 3, a word sense disambiguation model is constructed, and correct word sense is determined by evaluating the intimate level of word sense of ambiguous words and the word sense relevant words; 4, parameters of the word sense disambiguation model in the step 3 are optimized according to a word sense tagged corpus through a genetic algorithm; 5, the step 1 and the step 2 are repeatedly conducted on words to be subjected to disambiguation, and correct word sense of the ambiguous words is determined through the optimized word sense disambiguation model obtained in the step 4. According to the English word sense disambiguation method based on the phrase structure syntax tree, the phrase structure syntax tree is utilized for screening the word sense relevant words and giving disambiguation weight to the word sense relevant words, interference of noise words can be reduced, the computing accuracy of word sense relevancy is improved, and the accuracy of English word sense disambiguation is improved.

Description

English word meaning disambiguation method based on phrase structure syntax tree

Technical Field

The invention relates to an English word meaning disambiguation method, in particular to an English word meaning disambiguation method based on a phrase structure syntax tree, and belongs to the technical field of natural language processing.

Background

Word sense disambiguation refers to determining the correct word sense of an ambiguous word according to the context in which the word is located. The word sense is a basic unit constituting the meaning of a sentence and is a precondition for understanding a sentence. Word sense disambiguation belongs to basic tasks in the field of natural language processing, and has wide application requirements in the fields of machine translation, information retrieval, text classification, question and answer systems and the like.

The word sense of an ambiguous word is determined by the context in which it is located. Whether a word related to a context word sense can be accurately selected will directly affect the performance of the word sense disambiguation system. The existing word sense disambiguation method generally utilizes a context sliding window to select context related words, namely, words within a certain distance from the left to the right are selected by taking ambiguous words as the center. The method only considers the direct distance of the words in the sentence, and does not consider the grammatical and semantic relations of the words. The method cannot filter out short-distance noise words and easily omits long-distance related words.

The word senses of ambiguous words are typically determined by comparing how closely each word sense is to a word associated with a context word sense. Whether the closeness degree can be accurately calculated has a decisive influence on the performance of the word sense disambiguation system. The influence degree of the related words with different distances on the ambiguous word senses is different, and proper disambiguation weight needs to be given. The existing word sense disambiguation method generally considers the weights of the words related to the context word sense to be equal, which cannot reflect the weight difference of the words with different distances and is difficult to accurately evaluate the closeness degree of the word sense and the words related to the context word sense.

In view of the above problems, the present application provides an english word sense disambiguation method based on a phrase structure syntax tree, which can make full use of the phrase structure syntax tree to screen words related to word senses and assign disambiguation weights to the words, and determine correct word senses according to the closeness degree of the word senses and the words related to context word senses.

Disclosure of Invention

The invention aims to overcome the defects of the existing word sense disambiguation technology, mainly solves the problems of screening of context word sense related words and calculation of empowerment and word sense correlation degree, and provides a novel English word sense disambiguation method based on a phrase structure syntax tree.

The purpose of the invention is realized by the following technical scheme.

An English word meaning disambiguation method based on a phrase structure syntax tree comprises the following specific operation steps.

Step one, generating a phrase structure syntactic tree by analyzing the phrase structure syntactic of a sentence; the details are as follows.

Step 1.1: the sentence to be processed is denoted by the symbol S.

Step 1.2: preprocessing the sentence S, which mainly includes removing messy code characters, special symbols, english word breaks (Tokenization), and the like, to obtain a preprocessed sentence S'.

Step 1.3: and performing phrase structure syntactic analysis on the sentence S' by using a phrase structure syntactic analyzer to generate a phrase structure syntactic tree T.

Step 1.4: and performing word shape reduction on the words in the phrase structure syntactic tree T.

Step two, calculating the hierarchical distance and the path distance between the ambiguous word and other words in the sentence based on the phrase structure syntax tree, and screening out words related to the word meaning; the details are as follows.

Step 2.1: by the symbol w_tRepresenting ambiguous words to be disambiguated, representing other words in the sentence by the symbol W, representing the disambiguated words W in the sentence by the symbol W_tA set of all real words except.

Step 2.2: counting ambiguous words w by a phrase structure syntax tree T_tHierarchical distance d from other words w_lD is mixing_lRecord W and save it to W.

Step 2.3: counting ambiguous words w by a phrase structure syntax tree T_tPath distance d from other words w_pD is mixing_pRecord W and save it to W.

Step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from W_lIs not greater than d _ layer and d_pAnd constructing a word sense related word set R of the ambiguous words by the words not larger than the d _ path.

Step three, constructing a word sense disambiguation model, and judging correct word senses by evaluating the closeness degree of each word sense of ambiguous words and words related to the word senses; the details are as follows.

Step 3.1: for each word w in the word sense related word set R, the distance d is determined according to the hierarchy_lAnd a path distance d_pIts disambiguation weight is calculated from equation (1).

(1)

Wherein α and β are hierarchical distances d_lAnd a path distance d_pThe adjustment parameter of (2).

Step 3.2: for ambiguous word w_tEach sense s of_iThe closeness to the word sense related word set R is calculated by formula (2).

(2)

Wherein s is_iTo represent ambiguous words w_tThe ith sense of word, sense (w)_t) To represent ambiguous words w_tSet of all word senses of (1), s_i∈sense(w_t)，w_jRepresenting the jth word-sense related word, and R representing an ambiguous word w_tOf all word-sense related words, w_j∈R，weight(w_j) Represents w calculated by the formula (1)_jDisambiguation weight of (1), wnss(s)_i,w_j) Representing a sense of word s_iWord meaning related word w_jThe word sense relevancy of (1).

Step 3.3: according to the respective sense s obtained from step 3.2_iAnd selecting the word senses with the highest closeness degree as the correct word senses of the ambiguous words according to the closeness degree of the word set R related to the word senses.

Step four, marking a corpus by word senses, and optimizing parameters of the word sense disambiguation model in the step three by using a genetic algorithm to obtain an optimized word sense disambiguation model; the details are as follows.

Step 4.1: and selecting an appropriate word sense labeling Corpus Corpus.

Step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set C_train。

Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function C_trainAnd performing optimization training to obtain optimal parameters of d _ layer, d _ path, α and β.

(3)

Wherein precision is disambiguation accuracy, and the value of precision is the ratio of the number of correctly disambiguated ambiguous words to the total number of ambiguous words.

Step 4.4: and (4) substituting the d _ layer and the d _ path obtained in the step (4.3) into the step (2.4), and substituting alpha and beta into the formula (1) to complete the parameter optimization of the word sense disambiguation model.

Step five, repeating the step one and the step two for the word to be disambiguated, and judging the correct word sense of the ambiguous word by using the optimized word sense disambiguation model obtained in the step four; the details are as follows.

Step 5.1: according to the step one, generating a word w to be disambiguated_tThe phrase structure syntax tree T of the sentence in which it is located.

Step 5.2: according to the second step, the word w to be disambiguated is obtained_tAnd (4) the hierarchical distance and the path distance with other words in the sentence, and word sense related words are screened according to the d _ layer and the d _ path obtained in the step four to construct a word sense related word set R.

Step 5.3: and 3.1, calculating the disambiguation weight of each word sense related word in the word sense related word set R according to the alpha and beta parameters obtained in the step four.

Step 5.4: from step 3.2, ambiguous word w is determined_tEach sense s of_iThe degree of closeness of the set of words R in relation to the sense of the word.

Step 5.5: from step 3.3, ambiguous word w is determined_tThe correct sense of word.

Through the operations of the steps, the word senses of the English ambiguous words can be judged, and the word sense disambiguation task is completed.

Advantageous effects

The invention provides an English word meaning disambiguation method based on a phrase structure syntax tree, which uses the phrase structure syntax tree as a screening basis of context word meaning related words of ambiguous words; giving disambiguation weight to the word meaning related words according to the hierarchical distance and the path distance of the word meaning related words and the ambiguous words on a phrase structure syntax tree; and judging the correct word senses according to the association degree of each word sense of the ambiguous words and the words related to the context word sense. Compared with the existing English word meaning disambiguation method, the English word meaning disambiguation method based on the phrase structure syntax tree can more accurately screen the words related to the context word meaning, endow the words related to the context word meaning with proper disambiguation weight, and more accurately calculate the closeness degree of the ambiguous word meaning and the words related to the context word meaning. The method can effectively avoid the problems of inaccurate screening and weighting of the word meaning related words in the traditional method, improve the calculation precision of the word meaning correlation degree and improve the accuracy of English word meaning disambiguation.

Drawings

FIG. 1 is a phrase structure syntax tree for sentences in an implementation of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

Take the sentence "", the ambiguous word coach' teachlingfoundation lington bus @ as an example, and perform disambiguation processing on the ambiguous word coach therein.

The word senses of the ambiguous word, coach, are shown in table 1 according to the wordnet3.0 dictionary.

TABLE 1 sense Table of coach # n

Word meaning number	Description of word sense
		coach#n#1	coach,manager, handler -- ((sports) someone in charge of training an athlete or a team)
coach#n#2	coach, private instructor, tutor -- (a person who gives private instruction (as in singing,acting, etc.))
		coach#n#3	passenger car, coach, carriage -- (a railcar where passengers ride)
coach#n#4	coach, four-in-hand,coach-and-four -- (a carriage pulled by four horses with one driver)
		coach#n#5	bus,autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach,omnibus, passenger vehicle -- (avehicle carrying many passengers; used for public transport; "he always rode the bus to work")

Wherein, # n denotes a noun; the word sense numbers #1, #2, #3, #4, #5 in wordnet 3.0.

Step 1.1: denote a sentence to be processed by symbol S, in this example S is "", Thecocces' teachlingFOOTBALLAStandingonthebus @ was "".

Step 1.2: the sentence S is preprocessed, which mainly includes removing scrambled characters, special symbols, english word breaks (Tokenization), and the like, to obtain a preprocessed sentence S', which can be referred to as "the sentence.

Step 1.3: and performing phrase structure syntactic analysis on the sentence S' by using a phrase structure syntactic analyzer to generate a phrase structure syntactic tree T. In this example, a syntax parser of stanford parser, supplied by stanford university, is used, and a syntax tree of phrase structure is generated as shown in fig. 1 using the englishpcfg.

Step 1.4: and performing word shape reduction on the words in the phrase structure syntactic tree T. In this example, the morphological reduction is performed by wordnet3.0 and the MorphAdorner kit provided by northwest university of america, and the words in fig. 1 are reduced to: the, coach, teach, football, be, stand, on, the, bus.

Step 2.1: by the symbol w_tRepresenting ambiguous word cocah to be disambiguated, representing other words in the sentence by symbol W, representing disambiguated word W in the sentence by symbol W_tThe set of all the other real words, namely { teach # n, football # n, stand # v, bus # n } (where # n denotes nouns and # v denotes verbs).

Step 2.2: counting the hierarchical distance d between the ambiguous word coach and other words w by the phrase structure syntax tree T_lD is mixing_lRecord W and save it to W. And if the common father node of the coach and w in the T is f, the hierarchical distance is the path distance length of the coach and f minus 1. In this example, as can be seen from fig. 1, the hierarchical distances between coach and teach, football, stand, and bus are: 1,1,2,2.

Step 2.3: counting the path distance d between the ambiguous word coach and other words w by the phrase structure syntax tree T_pD is mixing_pRecord W and save it to W. In this example, as can be seen from fig. 1, the path distances between coach and teach, football, stand, and bus are as follows: 4,4,7,9.

Step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from W_lIs not greater than d _ layer and d_pAnd constructing a word sense related word set R of the ambiguous words by using the set of the words w not greater than d _ path. In this example, the word sense related word set of the coach is { teach # n, football # n, stand # v, bus # n } when d _ layer and d _ path are set to 2 and 9, respectively.

(1)

In this example, setting the sum to 1 and 0 corresponds to assigning the weight of each word-sense-related word to 1.

(2)

In this example, the word sense related word set R = { teach # n, football # n, stand # v, bus # n } of the ambiguous word coach # n, first, the word sense correlation degree, i.e., the wnss value, between each word sense of coach # n and each related word needs to be calculated. wnss can be accomplished with the help of multiple similarity or relevance calculation tools; the Similarity toolkit written by TedPedersen is selected for calculation, and the relevancy of each word sense is shown in Table 2.

TABLE 2 word sense relatedness of word sense of coach # n to related word

	teach#n	football#n	stand#v	bus#n
					coach#n#1	0.0274664653923546	0.474638267730824	0.0794203349688148	0.0953982038879483
coach#n#2	0.0411270396042137	0.0636370034284592	0.125973809222455	0.105985587733038
					coach#n#3	0.0441240510549878	0.109828009114997	0.118997168597431	0.165005388203732
coach#n#4	0.0395030928811857	0.118434570601007	0.116094035457169	0.31888473124512
					coach#n#5	0.0563124527152087	0.113685514457318	0.113552132406334	0.999999999999987

The correlation values in Table 2 were calculated using the WordNet:: Similarity:: vector _ calls metric.

For later calculation, the related words w are first calculated_jCalculatingThe value of (c). Wherein,=wnss(coach#n#1,teach#n)+wnss(coach#n#2,teach#n)+wnss(coach#n#3,teach#n)+wnss(coach#n#4,teach#n)+wnss(coach#n#5,teach#n)=0.0274664653923546+0.0411270396042137+0.0441240510549878+0.0395030928811857+0.0563124527152087=0.20853310164795047。

in the same way, the method can obtain,

=0.8802233653326053；

=0.5540374806522037；

=1.6852739110698254。

for the sense of word, coach # n #1, the relationship (coach # n #1) = is expressed by the formula (2)+++=+++=0.13171273613300974+0.5392247995501404+0.14334830718550395+0.05660694280100067=0.8708927856696547。

Similarly, for other word senses, from equation (2), it can be derived

relatedness(coach#n#2)=0.5597805034534482；

relatedness(coach#n#3)=0.6490573694718037；

relatedness(coach#n#4)=0.7227439715647753；

relatedness(coach#n#5)=1.197525369840318。

In this example, the word sense relevancy (relatedness value) calculated in step 3.2 is compared, and the coach # n #5 with the highest relevancy is selected as the correct sense of the ambiguous word (in fact coach # n #5 is the wrong sense, and the subsequent steps will correct the wrong decision by optimizing the model parameters).

Step 4.1: and selecting an appropriate word sense labeling Corpus Corpus. In practice, any type of word sense annotation corpus may be employed. In this example, the partial markup corpus in Reuters BNC supplied by DianaMcCarthy and RobKoeling was selected.

Step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set C_train. In this case, the ReutersBNC selected in step 4.1 can be used directly as the training data set. For other labeled corpora, a training data set can be constructed only by carrying out simple text processing and conversion.

Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function C_trainCarry out optimization training to obtain the optimumD _ layer, d _ path, α, β parameters.

(3)

In this example, the optimal parameters are obtained by the Geneticalithm of OptimizationTool, supplied by Matlab software, using the Matlab default settings. Trained, the 4 parameters in this example are optimized to 3, 10, 0.5, 1.2, respectively.

In this example, words having a hierarchical distance of not more than 3 and a path distance of not more than 10 are to be used as word sense related words of the ambiguous word. Equation (1) will be rewritten as equation (4):

(4)

wherein, the alpha and beta in the formula (1) are respectively optimized to be 0.5 and 1.2.

In this embodiment, as an example of the sentence "", the ambiguous word coach is disambiguated.

Step 5.1: according to the step one, generating a word w to be disambiguated_tThe phrase structure syntax tree T of the sentence in which it is located. In this example, the phrase structureThe syntax tree is shown in fig. 1.

Step 5.2: according to the second step, the word w to be disambiguated is obtained_tAnd (4) the hierarchical distance and the path distance with other words in the sentence, and word sense related words are screened according to the d _ layer and the d _ path obtained in the step four to construct a word sense related word set R. In this example, the hierarchical distances of the phrase structure syntax tree, coach and teach, football, stand, and bus in fig. 1 are: 1,1,2, 2; coach and teach, football, stand, the path distance of bus is in proper order: 4,4,7,9. D _ layer and d _ path optimized in the fourth step are respectively 3 and 10, the hierarchical distance and the path distance of the team, the football, the stand and the coach meet the conditions, and therefore the constructed word meaning related word set R = { team # n, football # n, stand # v and bus # n }.

Step 5.3: and 3.1, calculating the weight of each word sense related word in the word sense related word set R according to the optimal parameters obtained in the step four. In this example, according to the formula (4), the disambiguation weights weight of the reach # n, the football # n, the stand # v, and the bus # n are respectively as follows: 0.2902804823653377, 0.2902804823653377, 0.12412383171664482, 0.11654517159405858.

Step 5.4: from step 3.2, ambiguous word w is determined_tEach sense s of_iThe degree of closeness of the set of words R in relation to the sense of the word. In this example, with respect to the sense coach # n #1, the relationship (coach # n #1) = is expressed by the formula (2)+++=+++

=0.03823363657834851+0.1565264349167673+0.0177929411579594+0.006597265862157682=0.2191502785152329。

In the same way, the method can obtain,

relatedness(coach#n#2)=0.11378754409746956；

relatedness(coach#n#3)=0.13571081450099737；

relatedness(coach#n#4)=0.1421077906515997；

relatedness(coach#n#5)=0.21047354027607934。

step 5.5: from step 3.3, ambiguous word w is determined_tThe correct sense of word. In this example, the word sense correlation (relatedness value) of each word sense of the coach obtained in step 5.4 is compared; the coach # n #1 with the highest degree of correlation is selected as the correct sense.

Claims

1. An English word meaning disambiguation method based on phrase structure syntax tree is characterized in that: the specific operation steps are as follows:

step one, generating a phrase structure syntactic tree by analyzing the phrase structure syntactic of a sentence; the method specifically comprises the following steps:

step 1.1: the sentence to be processed is represented by symbol S;

step 1.2: preprocessing the sentence S, wherein the preprocessing mainly comprises removing messy code characters, special symbols, English word breaks (Tokenization) and the like to obtain a preprocessed sentence S';

step 1.3: using a phrase structure syntactic analyzer to perform phrase structure syntactic analysis on the sentence S' to generate a phrase structure syntactic tree T;

step 1.4: performing morphology reduction on words in the phrase structure syntax tree T;

step two, calculating the hierarchical distance and the path distance between the ambiguous word and other words in the sentence based on the phrase structure syntax tree, and screening out words related to the word meaning; the method specifically comprises the following steps:

step 2.1: by the symbol w_tRepresenting ambiguous words to be disambiguated, representing other words in the sentence by the symbol W, representing the disambiguated words W in the sentence by the symbol W_tA set of all but real words;

step 2.2: counting ambiguous words w by a phrase structure syntax tree T_tHierarchical distance d from other words w_lD is mixing_lRecording W and storing the W in the W;

step 2.3: counting ambiguous words w by a phrase structure syntax tree T_tPath distance d from other words w_pD is mixing_pRecording W and storing the W in the W;

step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from W_lIs not greater than d _ layer and d_pConstructing a word meaning related word set R of the ambiguous words by the words not larger than the d _ path;

step three, constructing a word sense disambiguation model, and judging correct word senses by evaluating the closeness degree of each word sense of ambiguous words and words related to the word senses; the method specifically comprises the following steps:

step 3.1: for each word w in the word sense related word set R, the distance d is determined according to the hierarchy_lAnd a path distance d_pCalculating the disambiguation weight thereof by formula (1);

(1)

wherein α and β are hierarchical distances d_lAnd a path distance d_pThe adjustment parameters of (2);

step 3.2: for ambiguous word w_tEach sense s of_iCalculating the closeness degree of the word sense related word set R by the formula (2);

(2)

wherein s is_iTo represent ambiguous words w_tThe ith sense of word, sense (w)_t) To represent ambiguous words w_tSet of all word senses of (1), s_i∈sense(w_t)，w_jRepresenting the jth word-sense related word, and R representing an ambiguous word w_tOf all word-sense related words, w_j∈R，weight(w_j) Represents w calculated by the formula (1)_jDisambiguation weight of (1), wnss(s)_i,w_j) Representing a sense of word s_iWord meaning related word w_jThe word sense relevancy of (1);

step 3.3: according to the respective sense s obtained from step 3.2_iSelecting the word sense with the highest closeness degree as the correct word sense of the ambiguous word according to the closeness degree of the word set R related to the word senses;

step four, marking a corpus by word senses, and optimizing parameters of the word sense disambiguation model in the step three by using a genetic algorithm to obtain an optimized word sense disambiguation model; the method specifically comprises the following steps:

step 4.1: selecting a proper word sense labeling Corpus Corpus;

step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set C_train；

Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function C_trainCarrying out optimization training to obtain optimal parameters of d _ layer, d _ path, α and β;

(3)

wherein precision is disambiguation accuracy, and the value of precision is the ratio of the number of correctly disambiguated ambiguous words to the total number of the ambiguous words;

step 4.4: substituting the d _ layer and the d _ path obtained in the step 4.3 into the step 2.4, substituting alpha and beta into the formula (1), and completing the parameter optimization of the word sense disambiguation model;

step five, repeating the step one and the step two for the word to be disambiguated, and judging the correct word sense of the ambiguous word by using the optimized word sense disambiguation model obtained in the step four; the method specifically comprises the following steps:

step 5.1: according to the step one, generating a word w to be disambiguated_tA phrase structure syntax tree T of the sentence;

step 5.2: according to the second step, the word w to be disambiguated is obtained_tAnd (4) according to the hierarchical distance and the path distance of other words in the sentence, screening word meaning related words according to the d _ layer and d _ path parameters obtained in the step four, and constructing a word meaning related word set R;

step 5.3: calculating disambiguation weight of each word sense related word in the word sense related word set R according to the alpha and beta parameters obtained in the step four in the step 3.1;

step 5.4: from step 3.2, ambiguous word w is determined_tEach sense s of_iCloseness of the word set R to the sense;

step 5.5: from step 3.3, ambiguous word w is determined_tThe correct sense of word of;