CN110717015B

CN110717015B - Neural network-based polysemous word recognition method

Info

Publication number: CN110717015B
Application number: CN201910956103.8A
Authority: CN
Inventors: 姚念民; 郭顺
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2021-03-26
Anticipated expiration: 2039-10-10
Also published as: CN110717015A

Abstract

The invention provides a neural network-based polysemous word recognition method, and belongs to the field of data mining and natural language processing. The method mainly utilizes the semantics of the context in the text to identify the polysemous words and generate the polysemous word representation, and comprises five steps: 1) preprocessing the corpus; 2) pre-training word representation; 3) extracting a context; 4) identifying ambiguous words; 5) selection of an ambiguous word representation. The invention fully utilizes the excellent characteristics of the word vector and automatically identifies the polysemous words through the context semantic difference of the words. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.

Description

Neural network-based polysemous word recognition method

Technical Field

The invention belongs to the field of data mining and natural language processing, and particularly relates to a neural network-based polysemous word recognition method which can be applied to a plurality of natural language processing tasks such as text classification and emotion analysis.

Background

Word representation is a fundamental and important task in the fields of data mining and natural language processing. In recent years, neural network based approaches have attracted attention to learning distributed representations of words. Among them, the famous word2vec model is distinguished by high efficiency and easy use. Word2vec trains target words using their contexts and maps words of similar meaning to similar points in vector space. This model has been successful in many tasks based on generating high quality word representations, such as language modeling, text understanding, and machine translation.

Ambiguous word recognition is a popular research problem in natural language processing. The polysemous words refer to words with two or more meanings, most of which are common words with the most close relationship with life, and most of which are verbs and adjectives. The polysemous words can obtain good expression effect in the paraphrase, metaphor, lending and other paraphrases due to the characteristic of polysemous. The task of ambiguous word recognition is to allow a computer to automatically recognize ambiguous words present in a given paragraph or sentence and to give the word a more accurate word representation. The method has important significance in identifying the polysemous words, can improve the quality of word representation and segment representation, can more accurately dig out the emotion expressed by sentences, and improves the accuracy of natural language processing tasks.

At present, people have less research on polysemous word recognition. The existing method only trains each word in the text into a plurality of word representations blindly and does not achieve the purpose of automatic recognition. In addition, this approach not only consumes a lot of training time, but also occupies a lot of storage resources.

Disclosure of Invention

The invention aims to provide a neural network-based polysemous word recognition method, which can automatically identify polysemous words in a text according to context semantics and generate word representations closer to the context semantics for each polysemous word, thereby obtaining high-quality text representations and improving the accuracy of natural language processing tasks.

The technical scheme of the invention is as follows: a polysemous word recognition method based on neural network includes the following steps:

first, pre-processing corpus

1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in the text.

Second, pre-training word representation

2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. FIG. 1 shows a schematic representation of the CBOW model and the doc2vecC model.

2.2) after the pre-training is finished, storing a word-word vector mapping table.

Third, extracting context

3.1) define a new context window and rescan the whole corpus to extract the context of each word in different sentences.

3.2) counting words in the contexts, deleting repeated words, and generating a context dictionary corresponding to each word. Each line of the dictionary records a set of words that appear in the context of a word.

3.3) mapping each context dictionary in the step 3.2) with a corresponding word to construct a word-context dictionary mapping table.

Fourthly, recognizing the ambiguous word

4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering (k is more than or equal to 2) on the context corresponding to each word in the mapping table. Before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). After clustering, we can obtain the category to which each word in the context dictionary belongs, and the center vector of each category.

4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. The cluster evaluation algorithm needs to take the word representation participating in the cluster and the category to which the word belongs as input, and output as an evaluation value. And if the context evaluation result of a word is greater than a predefined threshold value, judging the word as a polysemous word.

4.3) outputting the polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as a word representation of different word senses.

Fifth, selection of ambiguous word representation

The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we introduce the operation steps of selecting an ambiguous word representation:

5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics.

5.2) obtaining the context of the ambiguous word by using the context window.

5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector.

5.4) calculating the distance between the context vector of the word and the word representation of its different sense, respectively. The distance can be measured in various ways such as euclidean distance and cosine distance.

5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context. Figure 2 shows a technical scheme of the invention.

The invention has the beneficial effects that: the excellent characteristics of the word vector are fully utilized, the polysemous words are identified through the context semantic difference of the words, and the automatic identification is really realized. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.

Drawings

FIG. 1 is a schematic representation of the CBOW model and the doc2vecC model. Wherein, (a) represents a model architecture of CBOW; (b) a model architecture of doc2vecC is represented.

Fig. 2 is a technical scheme diagram of ambiguous word recognition.

Detailed Description

The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As shown in fig. 2, the overall implementation flow includes five steps, and the following is a detailed description for each step:

first, pre-processing corpus

1.1) selecting a corpus of natural language processing tasks, denoted as D ═ D₁,...,D_nThe corpus contains n paragraphs, D_iRepresenting the ith paragraph in corpus D. Delete each paragraph D_iAfter the special characters and the non-recognizable characters in the Chinese character sequence are obtained, character sequences with different lengths are recorded as

Second, pre-training word representation

2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. Here we take CBOW model in word2vec as an example for detailed explanation.

The CBOW model contains three network layers: an input layer, a hidden layer, and an output layer. At the input level, the model defines a local context window, denoted c. The window represents that c words before and after the target word are taken, namely the total word number of the context window is |2c |. Model as paragraph D_iTraining is performed in units and word representations in a context window are taken as input. At the hidden layer, the model sums the word vectors of the context window of the input, noted

And finally, predicting the probability value corresponding to each word of the target word in the word list by the output layer. The training process of the model is to continuously predict the target words by using the words in the context window in the current paragraph, and to maximize the objective function L of the model,

wherein the content of the first and second substances,

indicating that the target word is paragraph D in the corpus_iThe number t of the word (a) in (b),

to represent

V denotes a dictionary of the current corpus.

Before training begins, the parameters of the model need to be set. The dimension of the word representation can be set to be between 100 and 1000, and the size of the context window c is set to be between 2 and 10. Because our proposed method requires identifying ambiguous words based on context semantics, precise context information needs to be preserved in the text. Therefore, we retain all low frequency words. Other parameters use default values.

2.2) after the pre-training is finished, obtaining a word-word vector mapping table and marking the mapping table as

Third, extracting context

3.1) define a new contextual window whose size cannot be larger than the one set in step 2.1). For example, if the size of the contextual window defined using the CBOW model is 5, then the size range of the new contextual window set at this step should be between 1-5. Rescanning the entire corpus using the context window, extracting the context of each word in different sentences in the word-word vector mapping table, generating a set

And

respectively represent words w₁M context and word w_nThe kth context of (1). Furthermore, the number of contexts for different words in a set need not be the same.

3.2) counting words in the contexts, deleting repeated words, generating a context dictionary for each target word and recording the context dictionary

I.e. w_tA set of all words in the context of (a).

3.3) constructing a mapping table of words and their context dictionaries, and recording as

Each row of this mapping table records a word in the dictionary and the set of all words that have appeared in its context.

Fourthly, recognizing the ambiguous word

4.1) loading the mapping table obtained in the step 3.3), and mapping the context dictionary corresponding to each word in the mapping table

And respectively carrying out k-means clustering. k represents the number of classes of the cluster and different classes represent different contextual semantics. We use the first line in the mapping table

For example, wherein

Before the clustering operation is carried out,

the words in the table are converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). Equation (2) is the objective function of the k-means algorithm,

wherein the content of the first and second substances,

and is

Is shown in the word w₁Into a set of words of class i,

denotes w₁The vector representation of the h-th word in the context set,

representing a set of categories

The central vector of (2). After clustering, we can get the set of context words for each category

And corresponding set of center vectors

After all the words in the mapping table are operated in the same way, a set is obtained

And collections

In a specific implementation, the set range of the cluster category is 2-5. Since most of the ambiguous words in the corpus basically contain 2 word senses, and there are few more word senses, we usually set 2 for the context cluster category of all words.

4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. We take the contour coefficient as an example, for the word w_tThe context clustering result is evaluated, and the specific operation steps are as follows:

4.2.1) Loading the set C obtained in step 4.1) and taking out the word w_tContext clustering result set of (2), i.e.

4.2.2) calculating the contour coefficient of each word in the set. First, a single word in the set is fetched

Calculating the average distance from the word to other words in the same category and recording as

Is called asWord

Inter-cluster dissimilarity.

The smaller the word is, the more the word should be clustered into this class. Calculating words

Average distance to all words of other categories

Called a word

Dissimilarity to other categories.

The larger the word, the less the word belongs to the other category. Then, according to the word

Inter-cluster dissimilarity of

Degree of dissimilarity with clusters

Definition word

The profile coefficients of (a) are:

4.2.3) computing sets

The overall profile coefficient of (a) is,

wherein d represents w_tTotal number of words in the set of contexts. The output range of the contour coefficient is [ -1,1]Higher values indicate better clustering. Evaluating the context clustering results of all the words in the set C to obtain a result set S ═ S (w)₁),...,S(w_n)}。

4.2.4) define a threshold value alpha for ambiguous word discrimination, each value in the set S is compared to alpha if S (w)_t)>Alpha, then the word w is determined_tIs a polysemous word.

4.3) outputting the polysemous word, and using the polysemous word to obtain the central vector of each category in the step 4.1)

Word representations that are different word senses.

Fifth, selection of ambiguous word representation

The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we present in detail the operation steps of selecting an ambiguous word representation:

5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics. Let the currently processed sentence be D_tWord w_t∈D_tRepresenting the ambiguous word in the paragraph.

5.2) obtaining the polysemous word w by using the context window defined in the step 3.1)_tIs described as

5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector,

5.4) separately calculating

And w_tWord representation of different word senses

The distance between them. The distance measurement can be performed in various ways such as euclidean distance formula (6) and cosine distance formula (7).

5.5) Final Slave Collection

To select one of w_tContext vector

The closest vector is represented as the word of the ambiguous word in the current context.

The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments and is not intended to limit the practice of the invention to these embodiments. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the protection of the present invention.

Claims

1. A polysemous word recognition method based on a neural network is characterized by comprising the following steps:

first, pre-processing corpus

1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in a text;

second, pre-training word representation

2.1) pre-training word vectors by using a word vector training tool for the preprocessed corpus; the word vector training tool comprises word2vec, doc2vecC and an improved model based on the word2vec and the doc2 vecC;

2.2) after the pre-training is finished, storing a word-word vector mapping table;

third, extracting context

3.1) defining a new context window, rescanning the whole corpus and extracting the context of each word in different sentences;

3.2) counting words in the context corresponding to each word, deleting repeated words, and generating a context dictionary corresponding to each word; each line of the dictionary records a set of words that appear in the context of a word;

3.3) mapping each context dictionary in the step 3.2) with corresponding words to construct a word-context dictionary mapping table;

fourthly, recognizing the ambiguous word

4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering on the context corresponding to each word in the mapping table, wherein k is more than or equal to 2; before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in the step 2.2); after clustering operation, obtaining the category to which each word in the context dictionary belongs and the central vector of each category;

4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm; the clustering evaluation algorithm needs to take word representation participating in clustering and the category of the word as input and output an evaluation value; when the evaluation result of the context of a word is larger than a predefined threshold value, judging the word as a polysemous word;

4.3) outputting a polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as word representation of different word senses;

fifth, selection of ambiguous word representation

5.1) rescanning words in the corpus, and once a target word appears in the polysemous word list, selecting a word expression which accords with the current context semantics for the polysemous word;

5.2) obtaining the context of the ambiguous word by using a context window;

5.3) obtaining word vectors of the words in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vectors as the context vector;

5.4) calculating the distance between the context vector of the word and the word representation of different word senses of the word respectively;

5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context.

2. The method according to claim 1, wherein the corpus of step 1.1) is an arbitrary corpus related to text representation.

3. The method for identifying ambiguous word based on neural network as claimed in claim 1 or 2, wherein said new context window of step 3.1) is the same as the context window in word2vec for defining the range of extracting context; the new contextual window size defined in step 3.1) cannot be larger than the window size defined when the pre-training words are represented in step 2.1).

4. The neural network-based ambiguous word recognition method of claim 1 or 2, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.

5. The neural network-based ambiguous word recognition method of claim 3, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.

6. The neural network-based ambiguous word recognition method of claim 1, 2 or 5, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.

7. The neural network-based ambiguous word recognition method of claim 3, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.