CN105677639A - English word sense disambiguation method based on phrase structure syntax tree - Google Patents

English word sense disambiguation method based on phrase structure syntax tree Download PDF

Info

Publication number
CN105677639A
CN105677639A CN201610011045.8A CN201610011045A CN105677639A CN 105677639 A CN105677639 A CN 105677639A CN 201610011045 A CN201610011045 A CN 201610011045A CN 105677639 A CN105677639 A CN 105677639A
Authority
CN
China
Prior art keywords
word
sense
words
ambiguous
disambiguation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610011045.8A
Other languages
Chinese (zh)
Inventor
鹿文鹏
成金勇
张维玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN201610011045.8A priority Critical patent/CN105677639A/en
Publication of CN105677639A publication Critical patent/CN105677639A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an English word sense disambiguation method based on a phrase structure syntax tree, and belongs to natural language processing. The method comprises the steps that 1, phrase structure syntax analysis is conducted on a sentence, and a phrase structure syntax tree of the sentence is generated; 2, word sense relevant words are screened by taking the phrase structure syntax tree as the basis; 3, a word sense disambiguation model is constructed, and correct word sense is determined by evaluating the intimate level of word sense of ambiguous words and the word sense relevant words; 4, parameters of the word sense disambiguation model in the step 3 are optimized according to a word sense tagged corpus through a genetic algorithm; 5, the step 1 and the step 2 are repeatedly conducted on words to be subjected to disambiguation, and correct word sense of the ambiguous words is determined through the optimized word sense disambiguation model obtained in the step 4. According to the English word sense disambiguation method based on the phrase structure syntax tree, the phrase structure syntax tree is utilized for screening the word sense relevant words and giving disambiguation weight to the word sense relevant words, interference of noise words can be reduced, the computing accuracy of word sense relevancy is improved, and the accuracy of English word sense disambiguation is improved.

Description

English word meaning disambiguation method based on phrase structure syntax tree
Technical Field
The invention relates to an English word meaning disambiguation method, in particular to an English word meaning disambiguation method based on a phrase structure syntax tree, and belongs to the technical field of natural language processing.
Background
Word sense disambiguation refers to determining the correct word sense of an ambiguous word according to the context in which the word is located. The word sense is a basic unit constituting the meaning of a sentence and is a precondition for understanding a sentence. Word sense disambiguation belongs to basic tasks in the field of natural language processing, and has wide application requirements in the fields of machine translation, information retrieval, text classification, question and answer systems and the like.
The word sense of an ambiguous word is determined by the context in which it is located. Whether a word related to a context word sense can be accurately selected will directly affect the performance of the word sense disambiguation system. The existing word sense disambiguation method generally utilizes a context sliding window to select context related words, namely, words within a certain distance from the left to the right are selected by taking ambiguous words as the center. The method only considers the direct distance of the words in the sentence, and does not consider the grammatical and semantic relations of the words. The method cannot filter out short-distance noise words and easily omits long-distance related words.
The word senses of ambiguous words are typically determined by comparing how closely each word sense is to a word associated with a context word sense. Whether the closeness degree can be accurately calculated has a decisive influence on the performance of the word sense disambiguation system. The influence degree of the related words with different distances on the ambiguous word senses is different, and proper disambiguation weight needs to be given. The existing word sense disambiguation method generally considers the weights of the words related to the context word sense to be equal, which cannot reflect the weight difference of the words with different distances and is difficult to accurately evaluate the closeness degree of the word sense and the words related to the context word sense.
In view of the above problems, the present application provides an english word sense disambiguation method based on a phrase structure syntax tree, which can make full use of the phrase structure syntax tree to screen words related to word senses and assign disambiguation weights to the words, and determine correct word senses according to the closeness degree of the word senses and the words related to context word senses.
Disclosure of Invention
The invention aims to overcome the defects of the existing word sense disambiguation technology, mainly solves the problems of screening of context word sense related words and calculation of empowerment and word sense correlation degree, and provides a novel English word sense disambiguation method based on a phrase structure syntax tree.
The purpose of the invention is realized by the following technical scheme.
An English word meaning disambiguation method based on a phrase structure syntax tree comprises the following specific operation steps.
Step one, generating a phrase structure syntactic tree by analyzing the phrase structure syntactic of a sentence; the details are as follows.
Step 1.1: the sentence to be processed is denoted by the symbol S.
Step 1.2: preprocessing the sentence S, which mainly includes removing messy code characters, special symbols, english word breaks (Tokenization), and the like, to obtain a preprocessed sentence S'.
Step 1.3: and performing phrase structure syntactic analysis on the sentence S' by using a phrase structure syntactic analyzer to generate a phrase structure syntactic tree T.
Step 1.4: and performing word shape reduction on the words in the phrase structure syntactic tree T.
Step two, calculating the hierarchical distance and the path distance between the ambiguous word and other words in the sentence based on the phrase structure syntax tree, and screening out words related to the word meaning; the details are as follows.
Step 2.1: by the symbol wtRepresenting ambiguous words to be disambiguated, representing other words in the sentence by the symbol W, representing the disambiguated words W in the sentence by the symbol WtA set of all real words except.
Step 2.2: counting ambiguous words w by a phrase structure syntax tree TtHierarchical distance d from other words wlD is mixinglRecord W and save it to W.
Step 2.3: counting ambiguous words w by a phrase structure syntax tree TtPath distance d from other words wpD is mixingpRecord W and save it to W.
Step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from WlIs not greater than d _ layer and dpAnd constructing a word sense related word set R of the ambiguous words by the words not larger than the d _ path.
Step three, constructing a word sense disambiguation model, and judging correct word senses by evaluating the closeness degree of each word sense of ambiguous words and words related to the word senses; the details are as follows.
Step 3.1: for each word w in the word sense related word set R, the distance d is determined according to the hierarchylAnd a path distance dpIts disambiguation weight is calculated from equation (1).
(1)
Wherein α and β are hierarchical distances dlAnd a path distance dpThe adjustment parameter of (2).
Step 3.2: for ambiguous word wtEach sense s ofiThe closeness to the word sense related word set R is calculated by formula (2).
(2)
Wherein s isiTo represent ambiguous words wtThe ith sense of word, sense (w)t) To represent ambiguous words wtSet of all word senses of (1), si∈sense(wt),wjRepresenting the jth word-sense related word, and R representing an ambiguous word wtOf all word-sense related words, wj∈R,weight(wj) Represents w calculated by the formula (1)jDisambiguation weight of (1), wnss(s)i,wj) Representing a sense of word siWord meaning related word wjThe word sense relevancy of (1).
Step 3.3: according to the respective sense s obtained from step 3.2iAnd selecting the word senses with the highest closeness degree as the correct word senses of the ambiguous words according to the closeness degree of the word set R related to the word senses.
Step four, marking a corpus by word senses, and optimizing parameters of the word sense disambiguation model in the step three by using a genetic algorithm to obtain an optimized word sense disambiguation model; the details are as follows.
Step 4.1: and selecting an appropriate word sense labeling Corpus Corpus.
Step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set Ctrain
Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function CtrainAnd performing optimization training to obtain optimal parameters of d _ layer, d _ path, α and β.
(3)
Wherein precision is disambiguation accuracy, and the value of precision is the ratio of the number of correctly disambiguated ambiguous words to the total number of ambiguous words.
Step 4.4: and (4) substituting the d _ layer and the d _ path obtained in the step (4.3) into the step (2.4), and substituting alpha and beta into the formula (1) to complete the parameter optimization of the word sense disambiguation model.
Step five, repeating the step one and the step two for the word to be disambiguated, and judging the correct word sense of the ambiguous word by using the optimized word sense disambiguation model obtained in the step four; the details are as follows.
Step 5.1: according to the step one, generating a word w to be disambiguatedtThe phrase structure syntax tree T of the sentence in which it is located.
Step 5.2: according to the second step, the word w to be disambiguated is obtainedtAnd (4) the hierarchical distance and the path distance with other words in the sentence, and word sense related words are screened according to the d _ layer and the d _ path obtained in the step four to construct a word sense related word set R.
Step 5.3: and 3.1, calculating the disambiguation weight of each word sense related word in the word sense related word set R according to the alpha and beta parameters obtained in the step four.
Step 5.4: from step 3.2, ambiguous word w is determinedtEach sense s ofiThe degree of closeness of the set of words R in relation to the sense of the word.
Step 5.5: from step 3.3, ambiguous word w is determinedtThe correct sense of word.
Through the operations of the steps, the word senses of the English ambiguous words can be judged, and the word sense disambiguation task is completed.
Advantageous effects
The invention provides an English word meaning disambiguation method based on a phrase structure syntax tree, which uses the phrase structure syntax tree as a screening basis of context word meaning related words of ambiguous words; giving disambiguation weight to the word meaning related words according to the hierarchical distance and the path distance of the word meaning related words and the ambiguous words on a phrase structure syntax tree; and judging the correct word senses according to the association degree of each word sense of the ambiguous words and the words related to the context word sense. Compared with the existing English word meaning disambiguation method, the English word meaning disambiguation method based on the phrase structure syntax tree can more accurately screen the words related to the context word meaning, endow the words related to the context word meaning with proper disambiguation weight, and more accurately calculate the closeness degree of the ambiguous word meaning and the words related to the context word meaning. The method can effectively avoid the problems of inaccurate screening and weighting of the word meaning related words in the traditional method, improve the calculation precision of the word meaning correlation degree and improve the accuracy of English word meaning disambiguation.
Drawings
FIG. 1 is a phrase structure syntax tree for sentences in an implementation of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
Take the sentence "", the ambiguous word coach' teachlingfoundation lington bus @ as an example, and perform disambiguation processing on the ambiguous word coach therein.
The word senses of the ambiguous word, coach, are shown in table 1 according to the wordnet3.0 dictionary.
TABLE 1 sense Table of coach # n
Word meaning number Description of word sense
coach#n#1 coach,manager, handler -- ((sports) someone in charge of training an athlete or a team)
coach#n#2 coach, private instructor, tutor -- (a person who gives private instruction (as in singing,acting, etc.))
coach#n#3 passenger car, coach, carriage -- (a railcar where passengers ride)
coach#n#4 coach, four-in-hand,coach-and-four -- (a carriage pulled by four horses with one driver)
coach#n#5 bus,autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach,omnibus, passenger vehicle -- (avehicle carrying many passengers; used for public transport; "he always rode the bus to work")
Wherein, # n denotes a noun; the word sense numbers #1, #2, #3, #4, #5 in wordnet 3.0.
Step one, generating a phrase structure syntactic tree by analyzing the phrase structure syntactic of a sentence; the details are as follows.
Step 1.1: denote a sentence to be processed by symbol S, in this example S is "", Thecocces' teachlingFOOTBALLAStandingonthebus @ was "".
Step 1.2: the sentence S is preprocessed, which mainly includes removing scrambled characters, special symbols, english word breaks (Tokenization), and the like, to obtain a preprocessed sentence S', which can be referred to as "the sentence.
Step 1.3: and performing phrase structure syntactic analysis on the sentence S' by using a phrase structure syntactic analyzer to generate a phrase structure syntactic tree T. In this example, a syntax parser of stanford parser, supplied by stanford university, is used, and a syntax tree of phrase structure is generated as shown in fig. 1 using the englishpcfg.
Step 1.4: and performing word shape reduction on the words in the phrase structure syntactic tree T. In this example, the morphological reduction is performed by wordnet3.0 and the MorphAdorner kit provided by northwest university of america, and the words in fig. 1 are reduced to: the, coach, teach, football, be, stand, on, the, bus.
Step two, calculating the hierarchical distance and the path distance between the ambiguous word and other words in the sentence based on the phrase structure syntax tree, and screening out words related to the word meaning; the details are as follows.
Step 2.1: by the symbol wtRepresenting ambiguous word cocah to be disambiguated, representing other words in the sentence by symbol W, representing disambiguated word W in the sentence by symbol WtThe set of all the other real words, namely { teach # n, football # n, stand # v, bus # n } (where # n denotes nouns and # v denotes verbs).
Step 2.2: counting the hierarchical distance d between the ambiguous word coach and other words w by the phrase structure syntax tree TlD is mixinglRecord W and save it to W. And if the common father node of the coach and w in the T is f, the hierarchical distance is the path distance length of the coach and f minus 1. In this example, as can be seen from fig. 1, the hierarchical distances between coach and teach, football, stand, and bus are: 1,1,2,2.
Step 2.3: counting the path distance d between the ambiguous word coach and other words w by the phrase structure syntax tree TpD is mixingpRecord W and save it to W. In this example, as can be seen from fig. 1, the path distances between coach and teach, football, stand, and bus are as follows: 4,4,7,9.
Step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from WlIs not greater than d _ layer and dpAnd constructing a word sense related word set R of the ambiguous words by using the set of the words w not greater than d _ path. In this example, the word sense related word set of the coach is { teach # n, football # n, stand # v, bus # n } when d _ layer and d _ path are set to 2 and 9, respectively.
Step three, constructing a word sense disambiguation model, and judging correct word senses by evaluating the closeness degree of each word sense of ambiguous words and words related to the word senses; the details are as follows.
Step 3.1: for each word w in the word sense related word set R, the distance d is determined according to the hierarchylAnd a path distance dpIts disambiguation weight is calculated from equation (1).
(1)
Wherein α and β are hierarchical distances dlAnd a path distance dpThe adjustment parameter of (2).
In this example, setting the sum to 1 and 0 corresponds to assigning the weight of each word-sense-related word to 1.
Step 3.2: for ambiguous word wtEach sense s ofiThe closeness to the word sense related word set R is calculated by formula (2).
(2)
Wherein s isiTo represent ambiguous words wtThe ith sense of word, sense (w)t) To represent ambiguous words wtSet of all word senses of (1), si∈sense(wt),wjRepresenting the jth word-sense related word, and R representing an ambiguous word wtOf all word-sense related words, wj∈R,weight(wj) Represents w calculated by the formula (1)jDisambiguation weight of (1), wnss(s)i,wj) Representing a sense of word siWord meaning related word wjThe word sense relevancy of (1).
In this example, the word sense related word set R = { teach # n, football # n, stand # v, bus # n } of the ambiguous word coach # n, first, the word sense correlation degree, i.e., the wnss value, between each word sense of coach # n and each related word needs to be calculated. wnss can be accomplished with the help of multiple similarity or relevance calculation tools; the Similarity toolkit written by TedPedersen is selected for calculation, and the relevancy of each word sense is shown in Table 2.
TABLE 2 word sense relatedness of word sense of coach # n to related word
teach#n football#n stand#v bus#n
coach#n#1 0.0274664653923546 0.474638267730824 0.0794203349688148 0.0953982038879483
coach#n#2 0.0411270396042137 0.0636370034284592 0.125973809222455 0.105985587733038
coach#n#3 0.0441240510549878 0.109828009114997 0.118997168597431 0.165005388203732
coach#n#4 0.0395030928811857 0.118434570601007 0.116094035457169 0.31888473124512
coach#n#5 0.0563124527152087 0.113685514457318 0.113552132406334 0.999999999999987
The correlation values in Table 2 were calculated using the WordNet:: Similarity:: vector _ calls metric.
For later calculation, the related words w are first calculatedjCalculatingThe value of (c). Wherein,=wnss(coach#n#1,teach#n)+wnss(coach#n#2,teach#n)+wnss(coach#n#3,teach#n)+wnss(coach#n#4,teach#n)+wnss(coach#n#5,teach#n)=0.0274664653923546+0.0411270396042137+0.0441240510549878+0.0395030928811857+0.0563124527152087=0.20853310164795047。
in the same way, the method can obtain,
=0.8802233653326053;
=0.5540374806522037;
=1.6852739110698254。
for the sense of word, coach # n #1, the relationship (coach # n #1) = is expressed by the formula (2)+++=+++=0.13171273613300974+0.5392247995501404+0.14334830718550395+0.05660694280100067=0.8708927856696547。
Similarly, for other word senses, from equation (2), it can be derived
relatedness(coach#n#2)=0.5597805034534482;
relatedness(coach#n#3)=0.6490573694718037;
relatedness(coach#n#4)=0.7227439715647753;
relatedness(coach#n#5)=1.197525369840318。
Step 3.3: according to the respective sense s obtained from step 3.2iAnd selecting the word senses with the highest closeness degree as the correct word senses of the ambiguous words according to the closeness degree of the word set R related to the word senses.
In this example, the word sense relevancy (relatedness value) calculated in step 3.2 is compared, and the coach # n #5 with the highest relevancy is selected as the correct sense of the ambiguous word (in fact coach # n #5 is the wrong sense, and the subsequent steps will correct the wrong decision by optimizing the model parameters).
Step four, marking a corpus by word senses, and optimizing parameters of the word sense disambiguation model in the step three by using a genetic algorithm to obtain an optimized word sense disambiguation model; the details are as follows.
Step 4.1: and selecting an appropriate word sense labeling Corpus Corpus. In practice, any type of word sense annotation corpus may be employed. In this example, the partial markup corpus in Reuters BNC supplied by DianaMcCarthy and RobKoeling was selected.
Step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set Ctrain. In this case, the ReutersBNC selected in step 4.1 can be used directly as the training data set. For other labeled corpora, a training data set can be constructed only by carrying out simple text processing and conversion.
Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function CtrainCarry out optimization training to obtain the optimumD _ layer, d _ path, α, β parameters.
(3)
Wherein precision is disambiguation accuracy, and the value of precision is the ratio of the number of correctly disambiguated ambiguous words to the total number of ambiguous words.
In this example, the optimal parameters are obtained by the Geneticalithm of OptimizationTool, supplied by Matlab software, using the Matlab default settings. Trained, the 4 parameters in this example are optimized to 3, 10, 0.5, 1.2, respectively.
Step 4.4: and (4) substituting the d _ layer and the d _ path obtained in the step (4.3) into the step (2.4), and substituting alpha and beta into the formula (1) to complete the parameter optimization of the word sense disambiguation model.
In this example, words having a hierarchical distance of not more than 3 and a path distance of not more than 10 are to be used as word sense related words of the ambiguous word. Equation (1) will be rewritten as equation (4):
(4)
wherein, the alpha and beta in the formula (1) are respectively optimized to be 0.5 and 1.2.
Step five, repeating the step one and the step two for the word to be disambiguated, and judging the correct word sense of the ambiguous word by using the optimized word sense disambiguation model obtained in the step four; the details are as follows.
In this embodiment, as an example of the sentence "", the ambiguous word coach is disambiguated.
Step 5.1: according to the step one, generating a word w to be disambiguatedtThe phrase structure syntax tree T of the sentence in which it is located. In this example, the phrase structureThe syntax tree is shown in fig. 1.
Step 5.2: according to the second step, the word w to be disambiguated is obtainedtAnd (4) the hierarchical distance and the path distance with other words in the sentence, and word sense related words are screened according to the d _ layer and the d _ path obtained in the step four to construct a word sense related word set R. In this example, the hierarchical distances of the phrase structure syntax tree, coach and teach, football, stand, and bus in fig. 1 are: 1,1,2, 2; coach and teach, football, stand, the path distance of bus is in proper order: 4,4,7,9. D _ layer and d _ path optimized in the fourth step are respectively 3 and 10, the hierarchical distance and the path distance of the team, the football, the stand and the coach meet the conditions, and therefore the constructed word meaning related word set R = { team # n, football # n, stand # v and bus # n }.
Step 5.3: and 3.1, calculating the weight of each word sense related word in the word sense related word set R according to the optimal parameters obtained in the step four. In this example, according to the formula (4), the disambiguation weights weight of the reach # n, the football # n, the stand # v, and the bus # n are respectively as follows: 0.2902804823653377, 0.2902804823653377, 0.12412383171664482, 0.11654517159405858.
Step 5.4: from step 3.2, ambiguous word w is determinedtEach sense s ofiThe degree of closeness of the set of words R in relation to the sense of the word. In this example, with respect to the sense coach # n #1, the relationship (coach # n #1) = is expressed by the formula (2)+++=+++
=0.03823363657834851+0.1565264349167673+0.0177929411579594+0.006597265862157682=0.2191502785152329。
In the same way, the method can obtain,
relatedness(coach#n#2)=0.11378754409746956;
relatedness(coach#n#3)=0.13571081450099737;
relatedness(coach#n#4)=0.1421077906515997;
relatedness(coach#n#5)=0.21047354027607934。
step 5.5: from step 3.3, ambiguous word w is determinedtThe correct sense of word. In this example, the word sense correlation (relatedness value) of each word sense of the coach obtained in step 5.4 is compared; the coach # n #1 with the highest degree of correlation is selected as the correct sense.
Through the operations of the steps, the word senses of the English ambiguous words can be judged, and the word sense disambiguation task is completed.

Claims (1)

1. An English word meaning disambiguation method based on phrase structure syntax tree is characterized in that: the specific operation steps are as follows:
step one, generating a phrase structure syntactic tree by analyzing the phrase structure syntactic of a sentence; the method specifically comprises the following steps:
step 1.1: the sentence to be processed is represented by symbol S;
step 1.2: preprocessing the sentence S, wherein the preprocessing mainly comprises removing messy code characters, special symbols, English word breaks (Tokenization) and the like to obtain a preprocessed sentence S';
step 1.3: using a phrase structure syntactic analyzer to perform phrase structure syntactic analysis on the sentence S' to generate a phrase structure syntactic tree T;
step 1.4: performing morphology reduction on words in the phrase structure syntax tree T;
step two, calculating the hierarchical distance and the path distance between the ambiguous word and other words in the sentence based on the phrase structure syntax tree, and screening out words related to the word meaning; the method specifically comprises the following steps:
step 2.1: by the symbol wtRepresenting ambiguous words to be disambiguated, representing other words in the sentence by the symbol W, representing the disambiguated words W in the sentence by the symbol WtA set of all but real words;
step 2.2: counting ambiguous words w by a phrase structure syntax tree TtHierarchical distance d from other words wlD is mixinglRecording W and storing the W in the W;
step 2.3: counting ambiguous words w by a phrase structure syntax tree TtPath distance d from other words wpD is mixingpRecording W and storing the W in the W;
step 2.4: specifying a layer distance parameter d _ layer and a path distance parameter d _ path, and screening d from WlIs not greater than d _ layer and dpConstructing a word meaning related word set R of the ambiguous words by the words not larger than the d _ path;
step three, constructing a word sense disambiguation model, and judging correct word senses by evaluating the closeness degree of each word sense of ambiguous words and words related to the word senses; the method specifically comprises the following steps:
step 3.1: for each word w in the word sense related word set R, the distance d is determined according to the hierarchylAnd a path distance dpCalculating the disambiguation weight thereof by formula (1);
(1)
wherein α and β are hierarchical distances dlAnd a path distance dpThe adjustment parameters of (2);
step 3.2: for ambiguous word wtEach sense s ofiCalculating the closeness degree of the word sense related word set R by the formula (2);
(2)
wherein s isiTo represent ambiguous words wtThe ith sense of word, sense (w)t) To represent ambiguous words wtSet of all word senses of (1), si∈sense(wt),wjRepresenting the jth word-sense related word, and R representing an ambiguous word wtOf all word-sense related words, wj∈R,weight(wj) Represents w calculated by the formula (1)jDisambiguation weight of (1), wnss(s)i,wj) Representing a sense of word siWord meaning related word wjThe word sense relevancy of (1);
step 3.3: according to the respective sense s obtained from step 3.2iSelecting the word sense with the highest closeness degree as the correct word sense of the ambiguous word according to the closeness degree of the word set R related to the word senses;
step four, marking a corpus by word senses, and optimizing parameters of the word sense disambiguation model in the step three by using a genetic algorithm to obtain an optimized word sense disambiguation model; the method specifically comprises the following steps:
step 4.1: selecting a proper word sense labeling Corpus Corpus;
step 4.2: collecting each ambiguous word, the sentence where the ambiguous word is and the correct word sense label in the Corpus Corpus, and constructing a word sense disambiguation model training data set Ctrain
Step 4.3, taking the hierarchical distance parameter d _ layer, the path distance parameter d _ path and the adjusting parameters α, β in the steps 2.4 and 3.1 as the input vector of the genetic algorithm, taking the formula (3) as the objective function of the genetic algorithm, and calculating the target function CtrainCarrying out optimization training to obtain optimal parameters of d _ layer, d _ path, α and β;
(3)
wherein precision is disambiguation accuracy, and the value of precision is the ratio of the number of correctly disambiguated ambiguous words to the total number of the ambiguous words;
step 4.4: substituting the d _ layer and the d _ path obtained in the step 4.3 into the step 2.4, substituting alpha and beta into the formula (1), and completing the parameter optimization of the word sense disambiguation model;
step five, repeating the step one and the step two for the word to be disambiguated, and judging the correct word sense of the ambiguous word by using the optimized word sense disambiguation model obtained in the step four; the method specifically comprises the following steps:
step 5.1: according to the step one, generating a word w to be disambiguatedtA phrase structure syntax tree T of the sentence;
step 5.2: according to the second step, the word w to be disambiguated is obtainedtAnd (4) according to the hierarchical distance and the path distance of other words in the sentence, screening word meaning related words according to the d _ layer and d _ path parameters obtained in the step four, and constructing a word meaning related word set R;
step 5.3: calculating disambiguation weight of each word sense related word in the word sense related word set R according to the alpha and beta parameters obtained in the step four in the step 3.1;
step 5.4: from step 3.2, ambiguous word w is determinedtEach sense s ofiCloseness of the word set R to the sense;
step 5.5: from step 3.3, ambiguous word w is determinedtThe correct sense of word of;
through the operations of the steps, the word senses of the English ambiguous words can be judged, and the word sense disambiguation task is completed.
CN201610011045.8A 2016-01-10 2016-01-10 English word sense disambiguation method based on phrase structure syntax tree Pending CN105677639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610011045.8A CN105677639A (en) 2016-01-10 2016-01-10 English word sense disambiguation method based on phrase structure syntax tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610011045.8A CN105677639A (en) 2016-01-10 2016-01-10 English word sense disambiguation method based on phrase structure syntax tree

Publications (1)

Publication Number Publication Date
CN105677639A true CN105677639A (en) 2016-06-15

Family

ID=56299412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610011045.8A Pending CN105677639A (en) 2016-01-10 2016-01-10 English word sense disambiguation method based on phrase structure syntax tree

Country Status (1)

Country Link
CN (1) CN105677639A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126501A (en) * 2016-06-29 2016-11-16 齐鲁工业大学 A kind of noun Word sense disambiguation method based on interdependent constraint and knowledge and device
CN108804529A (en) * 2018-05-02 2018-11-13 深圳智能思创科技有限公司 A kind of question answering system implementation method based on Web
CN110008310A (en) * 2019-04-04 2019-07-12 北京神州泰岳软件股份有限公司 A kind of content search method and device
CN110333990A (en) * 2019-05-29 2019-10-15 阿里巴巴集团控股有限公司 Data processing method and device
CN111079429A (en) * 2019-10-15 2020-04-28 平安科技(深圳)有限公司 Entity disambiguation method and device based on intention recognition model and computer equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HEYAN HUANG等: "Knowledge-based Word Sense Disambiguation with Feature Words Based on Dependency Relation and Syntax Tree", 《INTERNATIONAL JOURNAL OF ADVANCEMENTS IN COMPUTING TECHNOLOGY》 *
郎倩雨等: "电力专业英语语料库在电力专业学习中的应用", 《学理论》 *
鹿文鹏: "基于依存和领域知识的词义消歧方法研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126501A (en) * 2016-06-29 2016-11-16 齐鲁工业大学 A kind of noun Word sense disambiguation method based on interdependent constraint and knowledge and device
CN106126501B (en) * 2016-06-29 2019-02-19 齐鲁工业大学 A kind of noun Word sense disambiguation method and device based on interdependent constraint and knowledge
CN108804529A (en) * 2018-05-02 2018-11-13 深圳智能思创科技有限公司 A kind of question answering system implementation method based on Web
CN110008310A (en) * 2019-04-04 2019-07-12 北京神州泰岳软件股份有限公司 A kind of content search method and device
CN110333990A (en) * 2019-05-29 2019-10-15 阿里巴巴集团控股有限公司 Data processing method and device
CN110333990B (en) * 2019-05-29 2023-06-27 创新先进技术有限公司 Data processing method and device
CN111079429A (en) * 2019-10-15 2020-04-28 平安科技(深圳)有限公司 Entity disambiguation method and device based on intention recognition model and computer equipment
CN111079429B (en) * 2019-10-15 2022-03-18 平安科技(深圳)有限公司 Entity disambiguation method and device based on intention recognition model and computer equipment

Similar Documents

Publication Publication Date Title
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN107423284B (en) Method and system for constructing sentence representation fusing internal structure information of Chinese words
CN110287494A (en) A method of the short text Similarity matching based on deep learning BERT algorithm
CN103823794B (en) A kind of automatization's proposition method about English Reading Comprehension test query formula letter answer
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
CN105677639A (en) English word sense disambiguation method based on phrase structure syntax tree
CN108052625B (en) Entity fine classification method
CN103678271B (en) A kind of text correction method and subscriber equipment
CN106372061A (en) Short text similarity calculation method based on semantics
CN106294466A (en) Disaggregated model construction method, disaggregated model build equipment and sorting technique
US20170169355A1 (en) Ground Truth Improvement Via Machine Learned Similar Passage Detection
CN104794169A (en) Subject term extraction method and system based on sequence labeling model
Rozovskaya et al. Building a state-of-the-art grammatical error correction system
CN110705247B (en) Based on x2-C text similarity calculation method
CN105183715B (en) A kind of word-based distribution and the comment spam automatic classification method of file characteristics
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
CN112561718A (en) Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing
JPWO2014002774A1 (en) Synonym extraction system, method and recording medium
Stein et al. Intrinsic Plagiarism Analysis with Meta Learning.
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN116720509A (en) Construction method of emotion dictionary in student teaching evaluation field

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160615