CN107844473B

CN107844473B - Word sense disambiguation method based on context similarity calculation

Info

Publication number: CN107844473B
Application number: CN201710876243.5A
Authority: CN
Inventors: 周俏丽; 孟禹光
Original assignee: Shenyang Aerospace University
Current assignee: Shenyang Aerospace University
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2020-12-18
Anticipated expiration: 2037-09-25
Also published as: CN107844473A

Abstract

The invention relates to a word sense disambiguation method based on context similarity calculation, which comprises the following steps: processing training corpora, and training a model by using a part-of-speech tagging version of ukWaC; screening parts of speech, and only keeping real words including nouns, adjectives, adverbs and verbs; training a bidirectional LSTM model by using the corpus with the screened part of speech; inputting example sentences of words to be disambiguated into a bidirectional LSTM model to obtain context vectors; inputting the context of the word to be disambiguated into a bidirectional LSTM model to obtain a context vector of the word to be disambiguated; calculating cosine similarity of the context vector of the word to be disambiguated and the context vector of the example sentence, and further selecting the semantics of the word to be disambiguated by using a k-nearest neighbor method according to the obtained similarity result. The invention can better model the semantic, directly combines the word and the part of speech with underlining behind the word, obtains the word vector which well distinguishes different parts of speech of the same word, and improves the disambiguation accuracy rate by 0.5 percent on the basis of the baseline experiment.

Description

Word sense disambiguation method based on context similarity calculation

Technical Field

The invention relates to a natural language translation technology, in particular to a word sense disambiguation method based on context similarity calculation.

Background

Word sense disambiguation, WSD for short, is a long-history problem and has wide application. Currently, there are three categories, supervised, unsupervised and knowledge-based. Although published supervised word sense disambiguation systems perform well when given large-scale training corpora of specific semantics, the lack of large-scale markup corpora is a major problem. This problem can be solved to some extent using pre-trained word vectors. Because the word vector trained on large-scale corpus in advance is used, more semantic grammar information is contained, and the supervised system is trained by using the word vector, the performance is improved. To infer the meaning of a word in a sentence, the target word and the context of the target word need to be clearly expressed. Context is defined as the portion of a sentence that remains after the word to be disambiguated is removed from the sentence. In order to better compute the context similarity, the context also needs to be represented in the form of a vector.

In previous disambiguation tasks, the context was simply represented by summing or weighted averaging word vectors over a window of target words. But word vectors pre-trained using this method contain very limited information due to the inherent link between the target word and its overall context. To infer word senses in a sentence, both the target word and the context vector of the target word need to contain information about the entire sentence. A common drawback of many current disambiguation systems is that they do not contain information about the order of the words. And the LSTM (Long Short-Term Memory network), especially the two-way LSTM (Long Short-Term Memory network), overcomes the defects, can model all words around the target word, and takes the word order into consideration. However, the two-way LSTM models, which use different parts of speech of a word as a point for modeling, are not very accurate, since the same word has different meanings if the parts of speech are different. The context2vec model trained under the condition of no addition of parts of speech does not utilize the parts of speech, and words with different parts of speech are regarded as the same word for modeling, so that the word which is supposed to have multiple semantics is represented by only one vector in a semantic space.

Disclosure of Invention

Aiming at the defects that the disambiguation accuracy is poor and the like caused by modeling that different parts of speech of a word are regarded as one point in the prior art for word meaning disambiguation, the invention aims to solve the problem of providing the word meaning disambiguation method based on the context similarity calculation, which can well distinguish the different parts of speech of the same word and improve the disambiguation accuracy.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention relates to a word sense disambiguation method based on context similarity calculation, which comprises the following steps:

1) processing training corpora, and training a model by using a part-of-speech tagging version of ukWaC;

2) screening parts of speech, and only keeping real words including nouns, adjectives, adverbs and verbs;

3) training a bidirectional LSTM model by using the corpus with the screened part of speech;

4) inputting example sentences of words to be disambiguated into a bidirectional LSTM model to obtain context vectors;

5) inputting the context of the word to be disambiguated into a bidirectional LSTM model to obtain a context vector of the word to be disambiguated;

6) calculating cosine similarity of the context vector of the word to be disambiguated and the context vector of the example sentence, and further selecting the semantics of the word to be disambiguated by using a k-nearest neighbor method according to the obtained similarity result.

In step 1), training the model using the part-of-speech tagged version of ukWaC is: the part of speech is automatically labeled by TreeTagger, and the part of speech and the word are combined together, and the part of speech information is added into the model to be trained together.

In step 3), the bidirectional LSTM model is trained by the linguistic data with the screened part of speech, and sentences with the sentence length less than or equal to 64 words are used.

In the step 4), the example sentence of the word to be disambiguated is input into the bidirectional LSTM model, and similarity calculation is carried out on the example sentence and the context vector of the word to be disambiguated, so that the example sentence which is closest to the word to be disambiguated is selected.

In the step 1), 3 different part-of-speech tagging modes are adopted to train the model originally, including fine classification part-of-speech tags, rough classification part-of-speech tags and part-of-speech tags only using real words, the part-of-speech tagging method only using the real words is to map the original part-of-speech tags, the rest parts-of-speech are not considered any more, then the words and the parts-of-speech are connected together in a word-part-of-speech mode to be regarded as a new word, and then the model is trained.

And mapping the original part-of-speech tag by using a part-of-speech tag method of the real word, selecting 4 kinds of real word parts-of-speech, mapping the part-of-speech of the subdivided classes of the 4 kinds of real words into corresponding part-of-speech classes, and removing the rest part-of-speech features from the corpus.

The invention has the following beneficial effects and advantages:

1. the method firstly analyzes the influence of words with different parts of speech on the semantics, selects several words with the largest influence on the semantics to label, better models the semantics, directly combines the words and the parts of speech with underlining behind the words to be used as one word, inputs the word into a context2vec model to train, and obtains a word vector to well distinguish the different parts of speech of the same word.

2. The training model labeled by the method is used for disambiguation, and the disambiguation accuracy is improved by 0.5% on the basis of the baseline experiment.

Drawings

FIG. 1 is a graphical representation of the variation of points in semantic space after adding parts of speech in the present invention;

FIG. 2 is a diagram of context2vec model with parts of speech added in the present invention.

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

Part of speech is very important semantic grammar information, because the ukWaC corpus used by the context vector model trained by the invention provides part of speech tagged versions, the invention directly combines the word and the part of speech with underlining behind the word, regards the word as a word and inputs the word into a context2vec (as shown in figure 3) model for training, and thus the obtained word vector can distinguish different parts of speech of the same word.

As shown in fig. 1, before adding part of speech, the plane word contains 3 forms of verb prototype, noun singular and adjective, 3 different semantic grammatical information are regarded as the same point, which is obviously unreasonable, but after adding part of speech, different parts of speech can be modeled respectively, and the adjective, the noun singular and the verb prototype of the plane are redistributed in space to have points belonging to their semantics. In this way, semantic grammar information can be better captured.

In the step 1), the disambiguation effect of the model trained by the part-of-speech tagging method only using real words is found to be the best by comparing the disambiguation effects of the model trained by adopting 3 different part-of-speech tagging modes, namely, the part-of-speech tagging of the fine classification, the part-of-speech tagging of the coarse classification and the part-of-speech tagging of the real words only. The original part-of-speech tag is subdivided into several types of parts-of-speech from the part-of-speech of a certain large category, the invention only selects a part-of-speech tagging method of 4 types of real word parts-of-speech, maps the part-of-speech of the subdivided category of the original part-of-speech tag into a corresponding part-of-speech category, removes the rest part-of-speech characteristics from the corpus without considering, then connects the word and the part-of-speech together in a word-part-of-speech mode to be regarded as a new word, and trains a model.

In this embodiment, a part-of-speech tagged version of ukWaC of 20 hundred million words is used to train the model, the part-of-speech is automatically tagged by TreeTagger, and the part-of-speech and the word are combined together and connected in a "_" manner, for example: applet, which is written as applet nn constitutes a new word. Thus, the part-of-speech information can be added into the model to be trained together. The context vector of the word to be disambiguated is obtained by using the obtained model, and the part of speech information is included.

In the step 2), only a few real words are reserved when the part of speech is screened: nouns, adjectives, adverbs, verbs. The effect of this step is to reduce the influence of other words with small influence on the semantic meaning and error-prone part-of-speech tagging on the quality of the model.

As a label sense disambiguation data set, the invention adopts a 2004 Senseval-3 lexical sample dataset which comprises 7860 training samples and 3944 test samples. This data set was used for parameter tuning and testing disambiguation accuracy.

Firstly, part-of-speech tagging is carried out on a training corpus, only the following tags are reserved for the tagged training corpus, and other tags are reserved for prototypes.

TABLE 1 part-of-speech tagging of real words

For example: it is the half a late pitch, with bicycle-type handlebars and a squirting lever at the rear, while the which you step on to activate It.

In order to train a sentence in the corpus used for two-way LSTM initially, the invention changes the sentence into:

It_pp is_vbz quite_rb a_dt hefty_jj spade_nn,_,with_in bicycle_nn-_:type_nn handlebars_nns and_cc a_dt sprung_vvn lever_nn at_in the_dt rear_nn,_,which_wdt you_pp step_vvp on_in to_to activate_vv it_pp._sent

only a few part-of-speech tags appearing in table 1 are retained, others retain the original words, and the above sentences become after the filtering:

It is quite_rb a hefty_jj spade_nn,with bicycle_nn-type_nn handlebars_nn and a sprung_vv lever_nn at the rear_nn,which you step_vv on to activate_vv it.

and 3) training the bidirectional LSTM model by using the linguistic data with the screened part of speech. To speed up the training process and facilitate comparison with the baseline experiment, sentences with sentence lengths greater than 64 are not used, which results in a 10% reduction in corpus size.

In the step 4), the example sentence of the word to be disambiguated is input into the bidirectional LSTM model to obtain a context vector. The step is used for calculating the similarity with the context vector of the word to be disambiguated, so that the example sentence closest to the word context to be disambiguated is selected, and the semantic corresponding to the example sentence is the real semantic of the word to be disambiguated.

The above is done to the sentences containing the words to be disambiguated and the example sentences containing the words to be disambiguatedLower partAnd (6) processing. If the original sentence is Avoid matching one in the kitchen, as fuels from recording one in the kitchen to activate the word, wherein the activate is the word to be disambiguated, the words are processed to be Avoid _ nn matching _ vv one in the kitchen _ nn, as fuels _ nn matching _ nn two in the kitchen _ rb in the kitchen _ jj to activate the word \vv the alarm _ nn. replaces activate _ vv in the sentence by [ 2 ]]And inputting the context vector v of the sentence into the previously trained bidirectional LSTM model₀。

Word to be disambiguated has example sentence S₁....S_nReplacing the word to be disambiguated with [ 2 ]]Inputting the two-way LSTM model to obtain n context vectors v₁...v_nV is to be₀Are respectively connected with v₁...v_nCalculating cosine similarity to obtain n values, taking the maximum 5 values, and assuming that corresponding sentences are S respectively_x1,S_x2,S_x3,S_x4,S_x5Each sentence corresponds to a semantic, and the semantic with the most occurrence times in the 5 sentences is the semantic of the word to be disambiguated.

In order to find out the most effective part-of-speech feature introduction method, the invention uses 3 different part-of-speech tagging modes to train the model: fine classification part-of-speech tags, coarse classification part-of-speech tags, part-of-speech tags using only real words.

TABLE 2 context of predicted word and disambiguation result comparison

From the comparison of the results in table 1, it can be seen that although the disambiguation accuracy of number 1 is higher than that of number 2, the target word predicted by its context does not conform to semantics or even grammar (after adding part of speech, the part of speech can be predicted together, and for comparison, it is not listed here). After the part of speech is added, it can be seen that the target word can be well predicted by number 2. The addition of the part of speech is explained, and the model plays a better role. However, although the context predicted word has a better effect after the part of speech is added, the accuracy rate in the word sense disambiguation task does not reach the effect before the part of speech is added, and the reason for this is probably because the TreeTagger used in this embodiment has too many part of speech tags.

And comparing the results of the table 1, and adding the part-of-speech marks of the rough classification, wherein the No. 3 does not achieve the disambiguation effect of the No. 1, but the context predicted word is more consistent with the semantic grammar. And according to the comparison of the results of No. 2 and No. 3, the disambiguation accuracy and the context prediction words are very close. This is because the two parts of speech tagging methods label the null words. The corpus is examined, wherein some of the thats labeled DT are labeled IN and the upon labeled IN is labeled RP. Although the number of the virtual words is limited, the frequency of each virtual word appearing in the training corpus is high, and the virtual words are generally used for forming a sentence frame, and if the words are labeled incorrectly, the words have larger influence on a semantic space.

As can be seen from the results in Table 1, the sequence number 4 is better than the model trained by the previous 3 methods, both in predicting the word and disambiguating effect. By comparing the two indexes, the conclusion can be drawn: and training a context2vec model by using the training corpus labeled by the real word label 2, so that the model performance is improved.

The above results are all compared with results obtained by using a single k value, and are not necessarily representative. In order to further illustrate the representativeness of the results, the model with the serial number 4 and the model with the serial number 1 are compared by using different k values (1 to 10) to obtain disambiguation results, and the disambiguation effect is basically improved under different k values through comparison, so that the conclusion obtained before the invention is met.

This example was also compared with other systems disambiguated in SE-3, and the results are shown in Table 2, with the second best results S-2 being achieved by Rothe and Schutze (2015), and the previous best results S-1 being achieved by Ando (2006), which is a 1.2% improvement and a much simpler process. S-3 is the result of the non-lexical context2vec model using the k-nearest neighbor method with k being 1. Ours-1 is the result obtained in this example using k ═ 5 using the context2vec model without addition of part of speech, and Ours-2 adds part of speech features to Ours-1 and selects the same value of k. It can be seen that the disambiguation accuracy rate is improved by 0.5% after the part-of-speech feature is added.

In this section, the invention uses 3 different part-of-speech tagging approaches to label the corpus and train the model. The first two of the words can not achieve the effect before adding the part of speech, and the analysis reason is that the number of times of appearance of the dummy words in the training corpus is large, the dummy words are main components forming a sentence frame, the part of speech marking errors of the dummy words can bring larger influence to modeling, and the semantemes expressed by different parts of speech of the words are always the same. The experimental result table name real word mark 2 obtains better effect than that without adding word property, and the result on the same test set shows that the accuracy rate of the patent method is improved by nearly 2% compared with the best result published by the prior art.

TABLE 3 results of different systems in SE-3 test set

TABLE 4 promotion on different systems after addition of part-of-speech

Claims

1. A word sense disambiguation method based on context similarity calculation is characterized by comprising the following steps:

6) calculating cosine similarity of the context vector of the word to be disambiguated and the context vector of the example sentence, and further selecting the semantics of the word to be disambiguated by using a k nearest neighbor method according to the obtained similarity result;

in the step 1), a model is trained by using a part-of-speech tagging method only using real words, the part-of-speech tagging method only using the real words is to map original part-of-speech tags, the rest parts-of-speech are not considered, then words and parts-of-speech are connected together in a word-part-of-speech mode to be regarded as a new word, and then the model is trained;

mapping original part-of-speech marks by using a part-of-speech marking method of real words only, selecting 4 kinds of part-of-speech of real words, mapping the part-of-speech of the subdivided classes of the 4 kinds of real words into corresponding part-of-speech classes, and removing the rest part-of-speech characteristics from the corpus;

in step 1), training the model using the part-of-speech tagged version of ukWaC is: the part of speech is automatically marked by TreeTagger, the part of speech and the word are combined together, and part of speech information is added into the model to be trained together; the adverb, the comparative adverb RBR and the adverb highest-level RBS are mapped as an adverb; the plural nouns NNS of the common nouns, the singular nouns NN of the common nouns, the singular nouns NP of the proper nouns and the plural nouns NPS of the proper nouns are mapped into the singular nouns NN of the common nouns; mapping the adjective highest-level JJS, the comparative-level adjective JJR and the adjective JJ into an adjective JJ; the real verb basic form VV, the real verb past VVD, the verb or present participle VVG of the real verb, the past participle VVN of the real verb, the present tense of the real verb does not include the third person named single VVP and the present tense of the real verb includes the third person named single VVZ mapped to the real verb basic form VV.

2. The method of claim 1, wherein the bidirectional LSTM model is trained in step 3) with the filtered linguistic data, and sentences with a sentence length of 64 words or less are used.

3. The word sense disambiguation method based on the context similarity calculation of claim 1, wherein in step 4), the example sentence of the word to be disambiguated is input into the bi-directional LSTM model, and the context vector of the word to be disambiguated is subjected to the similarity calculation, so as to select the example sentence closest to the context of the word to be disambiguated.