CN110705274B

CN110705274B - Fusion type word meaning embedding method based on real-time learning

Info

Publication number: CN110705274B
Application number: CN201910839702.1A
Authority: CN
Inventors: 桂盛霖; 方丹
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2023-03-24
Anticipated expiration: 2039-09-06
Also published as: CN110705274A

Abstract

The invention discloses a real-time learning-based fusion type word meaning embedding method, and belongs to the technical field of automatic generation of word vectors. The method comprises the steps of obtaining word sense vectors of words to be subjected to word sense embedding processing currently based on a neural network language model set by the neural network language model and based on projection output of the neural network language model; the input layer of the network structure of the neural network language model is used for acquiring a corresponding vector of a current word k in a preset word vector matrix V; the projection layer is used for judging the current word k, if the current word k is a univocal word, performing identity projection, and outputting a corresponding vector of the k in a preset word vector matrix V as the projection layer; if the word meaning vector is a multi-meaning word, the corresponding word meaning vector is obtained through a word meaning recognition algorithm based on real-time learning, and the projection layer outputs the obtained word meaning vector. The invention realizes the calculation and generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.

Description

Fusion type word meaning embedding method based on real-time learning

Technical Field

The invention belongs to the technical field of automatic generation of word vectors, and particularly relates to a fusion type word meaning embedding method based on real-time learning.

Background

In Natural Language Processing (NLP) related tasks, since machines cannot directly understand and analyze human languages, it is usually necessary to model natural languages and then provide them as input to a computer. Word vector (Word Representation) is a product of converting words in human language into abstract Representation, and the Word vectors commonly used at present have two types:

One-Hot replication: generating this type of word vector first requires counting all words in the corpus to generate a vocabulary N and a unique number for each word. For a word, the length of the corresponding generated word vector is | N |, the corresponding position of the word number in the word vector is 1, and the rest positions are 0. The problem of such vector representation is that it takes up a lot of space, resulting in high subsequent computation, and the word vector cannot characterize the relationship between words.

Distributed Representation generation this generation of word vectors overcomes the disadvantages of One-Hot Representation. Distributed replication represents words as dense vectors. The generation of the vector is usually a byproduct of some language model training, words in a corpus are mapped to a word vector space through training of a corpus, and the relationship between the vectors is word semantics and expression of lexical relationships. The similarity of word semantics can thus be represented by the degree of approximation of the word vector values.

The current processing for generating word vectors can be divided into the following according to the granularity of the language unit corresponding to the word vector:

(1) Word embedding: words in natural language are represented as vector data that a computer can process.

(2) Word sense embedding: specific semantics possessed by words in a natural language are expressed as vector data that can be processed by a computer.

Word sense embedding is one of the main drawbacks for word embedding class models: the problem that the word meaning of a polysemous word cannot be accurately expressed is that a word vector generation model which is more sensitive to the word meaning is gradually formed. The word meaning embedding class model can generate a plurality of word vectors corresponding to a plurality of semantemes of the multi-meaning words according to the semantic conditions of the multi-meaning words in the corpus, and the embedding model can be used for describing the words more accurately in semantic level. Currently, there are two main types of word sense embedding class models: a two-stage type and a fused type. The two-stage type means that the processes of word sense recognition and word vector generation are serially separated. The fusion model completes the word meaning recognition in the process of generating the word vector.

Schutze first proposed in 1998 to perform context grouping identification, to perform clustering by using the method of calculating the maximum expectation to identify word senses, and then to perform generation of word sense vectors. The subsequent two-stage models are substantially similar in concept and are typically different and optimized in terms of word sense recognition algorithms or text modeling. In 2010, reisinger and Moone expressed the context as a unary grammar, and the MovFV clustering method was used to complete word sense recognition. The Sense2vec tool uses part-of-speech information to achieve word Sense separation, but has the disadvantage of not considering that the word senses of multiple word senses of a partially ambiguous word may be the same. The fusion model utilizes the commonality that both word sense recognition and word vector generation are used for calculating the text context, and combines the two processes to reduce the calculation consumption. Neelakantan expands on a Word2vec model to prepare a fixed number of vectors for each polysemous Word, and selects a proper vector to update in the training process, so that the defect that the number of the senses of different polysemous words is often different, and the limitation is high. Yang Liu et al have optimized the word vector and generated the defect that only utilizes local information, propose TWE model, have added the theme information modeling information in the course.

Related research and tools in China are few at present, and the LDA (Linear discrete Analysis) model is used for modeling the subject and carrying out semantic annotation on the ambiguous word. The Sullocen obtains an semantic vector by utilizing a Chinese knowledge base HowNet to further learn the word vector. The Liquajia uses K-Means clustering to construct a two-stage model in the word sense identification stage, and the method has the defects that the K-Means algorithm needs to give the number of central clusters in advance, namely the number of generated word senses needs to be determined in advance, and the expandability is not good enough.

The existing word vector tool technology mainly comprises two categories of word level embedding and word meaning level embedding.

Among them, the word level embedding has the disadvantages that: 1) Word vectors generated by word training with multiple word senses are more biased to have more semantics in the corpus, and less semantics in the corpus are weakened; 2) Calculating contents which are irrelevant to semantics and appear in a result with higher similarity with a certain polysemous word; 3) Original triangle inequality of a word vector space is destroyed, so that the quality of the word vector space is reduced; the word sense level embedding model can be divided into a two-stage type and a fusion type, wherein the two-stage type model has the defects that the similarity between the word sense identification process and the vector generation process is neglected, the two processes are completed in series, and the efficiency is low. The fusion model cannot use the clustering algorithm with better effect, such as K-Means and DBSCAN, and the effect is usually inferior to that of the two-stage model.

Disclosure of Invention

The invention aims to: aiming at the existing problems, the method realizes the calculation and the generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.

The invention discloses a fusion type word meaning embedding method based on real-time learning, which comprises the following steps:

step 1: setting a neural network language model;

the network structure of the neural network language model comprises an input layer and a projection layer;

the input layer is used for acquiring a corresponding vector V (k) of a current word k in a preset word vector matrix V;

the projection layer is used for judging the current word k, if the current word k is a univocal word, the projection layer performs an identity projection, and the projection layer outputs X (k) = V (k); if the word is a polysemous word, acquiring a corresponding word sense vector C (k, h) through a word sense recognition algorithm getCenter (k, h) based on real-time learning, and outputting X (k) = C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;

step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirements are met (if the maximum training round number is reached, the output loss change rate meets the precision requirement and the like), stopping and storing the trained neural network language model;

and step 3: and inputting the word to be subjected to the word sense embedding processing into the trained neural network language model, and outputting the word sense vector of the word to be subjected to the word sense embedding processing based on the projection of the word.

The real-time learning-based word sense recognition algorithm getCenter (k, h) comprises the following specific processing procedures:

judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center; during training, the initial value of the set O is an empty set, and the training is finished continuously to obtain the initial value.

If the clustering center corresponding to the polysemous word k exists, respectively calculating the distance between the environment vector h and each corresponding clustering center, marking the minimum value of the distance between the environment vector h and each corresponding clustering center as min (L), and if the min (L) is smaller than a minimum distance threshold value delta, generating a new clustering center corresponding to the polysemous word k and adding the environment vector h into the newly generated clustering center; otherwise, the environment vector h is merged into the clustering center corresponding to min (L) to obtain a new clustering center

Wherein O is _ki Representing the clustering center corresponding to min (L); />

Word sense vector C corresponding to clustering center based on environment vector h _ki Obtaining a word sense vector C (k, h);

further, in order to accelerate the training process of the neural network language model, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the occurrence of the word g to be predicted under the output X (k) of the projection layer.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: the invention combines the word meaning recognition process and the word vector generation process based on real-time learning, further ensures the accuracy of the word meaning recognition process by means of the real-time learning type clustering algorithm under the condition of ensuring the calculation efficiency, and finally improves the quality of the word meaning vector.

Drawings

FIG. 1 is a schematic diagram of a neural network structure employed in the present invention in an exemplary embodiment;

FIG. 2 is a diagram illustrating a clustering process according to an embodiment;

FIG. 3 is a schematic diagram illustrating an implementation process of the fusion word sense embedding method based on real-time learning according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a training process in the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

The invention discloses a fusion type word sense embedding method based on real-time learning, which combines a word sense identification process based on real-time learning and a word vector generation process to generate a plurality of word sense vectors corresponding to word senses of multi-sense words. The method solves the problem that the ambiguous words in the traditional word vector only correspond to a single vector; the recognition and generation processes are combined, so that the calculation efficiency is high; the real-time learning type clustering algorithm is utilized, and the accuracy of word meaning identification is ensured.

Word vectors are typically a byproduct of the generation of training language models. Word vectors obtained using different language models will also vary. In the specific embodiment, a neural network language model is adopted, namely a three-layer neural network is built, the occurrence probability of the environmental word g is predicted and calculated by the current word k, a word sense recognition algorithm based on real-time learning is added in the word vector generation process, the calculation and generation of the word sense vector of the polysemous word are realized, and the overall structure of the adopted neural network is shown in fig. 1.

The network structure comprises an input layer, a projection layer and an output layer, and the calculation performed by each layer is as follows:

(1) Input layer (Input Laer): in the model initialization stage, a word vector matrix V with the size of m | D | is prestored according to the set length m of the word vector, wherein V | D | ₁ ,v ₂ \8230denotesthe matrix V element, D represents the sample library except for the set of words that occur less frequently than the lowest frequency set by the user. In an input layer, acquiring a corresponding vector V (k) of a current word k in a word vector matrix V;

referring to fig. 1, an input sample = (k, H (k) g), where k denotes a current word, H (k) denotes a set of ambient words of the word k under a window W, and g denotes a word to be predicted.

(2) Projection Layer (Projection Layer): judging a current word k, if the current word k is a univocal word (monoseme), performing equal projection, and outputting X (k) = V (k) on a projection layer; if the word is a polysemy word, acquiring a word sense vector C (k, h) corresponding to the word sense vector by calling a word sense recognition algorithm getCenter (k, h) based on real-time learning, wherein h represents an environment vector (namely an environment word) of a current word k, and outputting X (k) = C (k, h);

(3) Output Layer (ouput Layer): for outputting the probability of the occurrence of the word g to be predicted at the projection layer output X (k). In the specific embodiment, the Hierachy Softmax algorithm is adopted to output the probability of the word g to be predicted appearing under the output X (k) of the projection layer, and the output layer adopts a Huffman tree structure. And taking the vocabulary in the vocabulary set D as leaf nodes of the Huffman tree, wherein the non-leaf nodes in the tree represent parameters of the neural network.

In each training iteration, the parameters to be updated in the network include a word vector V (k), a word sense vector C (k) and output layer non-leaf node parameters. Finally, the word vector obtained by training is composed of a word vector matrix V and a word sense vector set C, the word vector corresponding to the word with the meaning is in the word vector matrix V, and the word sense word vector of the polysemous word is in the word sense vector set C.

Wherein, the word sense recognition algorithm getCenter (k, h) based on real-time learning is as follows:

in the process of word sense identification, a definition set C represents a polysemous word sense vector set, a set O represents a cluster center set (), when a current word k is a polysemous word, each corresponding cluster center in O is marked as O _ki Each cluster center is correspondingly provided with a sense vector C in C _ki 。

The getCenter (k, h) function updates the corresponding clustering center through the ambiguous word k and the environment vector h thereof, and returns the clustering center O of the ambiguous word k in the environment vector (condition) h _ki Corresponding word sense vector C _ki And is denoted as word sense vector C (k, h).

Referring to FIG. 2, the specific process for getcenter (k, h) is:

1) Inputting: an ambiguous word k, an environment vector h, a minimum distance threshold δ;

2) Judging whether a cluster center corresponding to the polysemous word k exists in the set O or not, if not, generating a new distance center (cluster) O for the cluster O _k1 And h is added to the newly generated cluster O _k1 Middle, turn 4); cluster O _k1 The cluster center of is O _k1 ；

If a plurality of clustering centers corresponding to the ambiguous word k exist: o is _k1 ,O _k2 ,…,O _kn Then h and the corresponding cluster center O are calculated respectively _ki Distance L (O) _ki H) to obtain a set of distances L = { L (O) = _k1 ,h),L(O _k2 ,h),…,L(O _kn H), wherein the calculation distance formula is shown as formula (1).

Wherein i =1,2, \8230;, n. n represents the current number of cluster centers,

represents a cluster center O _ki Of elements, i.e. cluster core O _ki Is an n' -dimensional vector, O _ki The element of (b) is its corresponding vector element; h is _j The jth vector element representing the context vector.

3) Finding the minimum distance min (L) and the corresponding cluster center O _ki (ii) a If min (L)<δ, then generating a new cluster O for it _k(n+1) So as to obtain a new cluster center set O corresponding to the polysemous word k _k Comprises the following steps: o is _k ＝{O _k1 ,O _k2 ,…,O _kn ,O _k(n+1) H is added to cluster O _k(n+1) Performing the following steps; otherwise incorporate h into O _ki Updating a cluster center, wherein an updating formula of the cluster center is shown in a formula (2);

4) And (3) outputting: h cluster center O _ki Corresponding word sense vector C _ki 。

The Hierachy Softmax algorithm is an optimization algorithm for improving the training efficiency of the neural network. The Huffman tree constructed in the output layer provides a structural basis for the implementation of the algorithm. First, the related concepts in the Huffman tree are defined:

1) The path from the root node of the Huffman tree to the leaf node k is called p ^k ；

2) Path p ^k The number of nodes contained is l ^k ；

3) The nodes on the path are represented as: p is a radical of ^k ₁ ,p ^k ₂ ,…,p ^k _lk Wherein p is ^k ₁ Representing a root node, and lk represents the number of nodes;

4) Huffman coding of nodes on a path is represented as: d is a radical of ^k ₂ ,d ^k ₃ ,…,d ^k _lk E to {0,1}, and the root node does not correspond to the code;

5) The parameter vector for a node on a path is represented as: theta ^k ₁ ,θ ^k ₂ ,…,θ ^k _lk ；θ ^k _j A parameter vector representing the jth (j =1,2, \ 8230;, lk) node on the path.

The core idea of the Hierachy Softmax algorithm is as follows: for a word k in the vocabulary set D, there must be and only one path p from the root node to the node k in the Huffman tree ^k On the path is l ^k A node, l ^k -1 branch. And regarding each binary branch as a classification process, defining the node with the Huffman code of 0 as a positive class, and regarding the node with the Huffman code of 1 as a negative class. Because the nodes are provided with parameter vectors, in each two-classification process, the probability of the nodes is divided into positive classes and negative classes, namely the probability of the nodes is divided into the formula (3) and the formula (4):

wherein, σ (-) represents a distribution function, θ represents a vector formed by parameter vectors of the first j-1 nodes, and e represents a natural base number.

The overall probability is obtained by multiplying all the branch probabilities, as shown in equation (5):

substituting the formula (5) into a maximum likelihood function, calculating the gradient by adopting random gradient rise, and in the network, updating the corresponding vector in the matrix V and the related path in the Huffman tree

And (4) parameters. When the current word k is a polysemous word, the corresponding word sense vector C (k, h) also needs to be updated.

Referring to fig. 3 and 4, the implementation steps of the present invention include the following steps:

step 1: preprocessing text data and initializing a model.

In this embodiment, the text data is subjected to word segmentation processing and then used as training data T.

Then, firstly, an empty model object is established, and parameters such as a sample window size W, a lowest word frequency F, a generated word vector length and a training round are set.

Initializing a built-in dictionary of the model after the model is built: and generating a dictionary D according to the words appearing in the training data T and the frequency of the words, discarding the words with the word frequency lower than that of F in the dictionary D, and sequencing the D according to the word frequency. And then generating a built-in word vector matrix V according to the dictionary D, and generating a Huffman tree structure so as to accelerate the training process by utilizing a Hierachy Softmax algorithm, wherein leaf nodes of the Huffman tree are words in the dictionary D, and non-leaf nodes are used as network parameters. And finally, carrying out random initialization on the word vector matrix V and the network parameter values.

After the initialization is finished, the training process for T is started. Single sample T in T _i Can be represented as t _i = (k, H (k), g) wherein k is the current word, H: (c), (d) and (d)k) Representing k as the set of ambient words under the window W, and g representing the word to be predicted.

Step 2: forward propagation and real-time learning.

In the training process, for a single sample T in T _i = (k, H (k), g) calculated layer by layer as three layer neural network structure: firstly, in an input layer, obtaining a corresponding vector V (k) of a current word k in a word vector matrix V, then, in a projection layer, judging k, if k is a polysemous word, a corresponding environment vector h thereof can be represented as formula (6):

after the environment vector is calculated, a word sense recognition algorithm based on real-time learning is called to obtain a sense item vector C (k, H) = getCenter (k, H) corresponding to k under H (k), wherein C represents a word sense word vector set. Let projection layer output X (k) = C (k, h); if the current word is a univocal word, the projection layer outputs X (k) = V (k); and finally, calculating the probability of the word g under X (k) according to a Hierachy Softmax algorithm in an output layer.

And step 3: and updating the network parameters.

After one-time sample forward propagation is completed, the gradient is calculated by adopting random gradient rise and parameters in the network are updated, and in the network, corresponding vectors in the word vector matrix V which needs to be updated and related paths in the Huffman tree

And (3) parameters, when the current word k is a multi-meaning word, the corresponding word sense vector C (k, h) needs to be updated. Is in the model->

See formula (7) for updating V corresponding to the vector, see formula (8) for updating V corresponding to the vector, and see formula (9) for updating C (k), where τ represents the learning rate set by the model. Obtaining final word vector matrixes V and C through repeated iterative training and parameter updating of the sample set, wherein word vectors corresponding to the univocal words are stored in V, and word senses corresponding to the polysemous words are stored in CA word vector.

And 4, step 4: and generating a word sense vector.

In order to improve the quality of generating the sense vectors, after the training is finished, the clusters with less sample number in the clusters formed by clustering and the corresponding sense vectors are deleted, namely the related sense vectors C of the polysemous words k in the sense vector set C _ki Checking its corresponding cluster center (cluster center) O _ki If cluster center O is clustered _ki If the number of corresponding clusters is less than the threshold value m, deleting the cluster and the word sense vector C _ki Reducing the word vector generation error caused by clustering to obtain the final word sense vector set C, C _ki Representing the word sense vector corresponding to the ambiguous word k.

After the word sense vector model is established, the word sense representation of the polysemous words is more accurate due to the word vector model compared with the traditional word vector model, and the algorithm effect of related application based on word vectors can be further improved, for example, the method can be applied to the related fields of natural language processing such as word similarity calculation, word sense disambiguation, text classification and the like. The specific processing procedure for realizing the above three applications based on the word sense vector is as follows:

1) Calculating the similarity between words based on the word sense vector:

in the word sense vector model, each word is represented as one to more word sense vectors according to its number of senses. When the similarity between two words is calculated, the similarity can be converted into cosine similarity between vectors for measurement. Assuming that the vectors are a and b, the cosine similarity calculation formula therebetween is shown in formula (10):

where n represents the dimensions of vectors a and b, a _i 、b _i Representing the vector elements of a and b, respectively.

Assuming that the words to be compared are c and d, because the polysemous words have a plurality of word sense vectors, when the similarity between words is calculated, the cosine similarity between each word sense vector in the word c and each word sense vector in the word d needs to be calculated, and finally the maximum cosine similarity among the output results is taken as the similarity between the word c and the word d.

2) Word sense disambiguation based on word sense vectors:

when disambiguating the ambiguous word, the word sense vector model can be used to determine, first, the context C = { C } of the ambiguous word to be disambiguated is obtained ₁ ,c ₂ ,…,c _n In which c is _i Representing words in a context window; subsequently, c is read _i Corresponding word sense vector e in the word sense vector model _i And summing and averaging the two vectors to obtain a context environment vector h, namely a formula (11):

/>

and finally, calculating the cosine similarity between the context environment vector h and the word sense vector of the polysemous word to be disambiguated, and outputting the word sense corresponding to the word sense vector with the maximum cosine similarity to finish the disambiguation.

3) Text classification based on word sense vectors:

because the ambiguous word has only unique semantics in a specific context, if the corresponding semantics of the ambiguous word in the context are defined, the accuracy of text classification can be effectively improved.

Firstly, carrying out (11) disambiguation based on word sense vectors on the polysemous words in the text to be classified to obtain a word sense vector corresponding to each polysemous word; then, reading a corresponding vector of the word sense of the word in the text in the word sense vector model and the word sense vector of the polysemous word obtained in the previous step, and accumulating to obtain a vector w, wherein the vector w is a text vector; and finally, establishing a proper classifier, and training the classifier by using the text vector set to obtain a trained classification model.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The fusion type word sense embedding method based on real-time learning is characterized by comprising the following steps of:

step 1: setting a neural network language model;

the projection layer is used for judging the current word k, if the current word k is a univocal word, the projection layer performs identity projection, and the projection layer outputs X (k) = V (k); if the word meaning is a polysemous word, acquiring a corresponding word meaning vector C (k, h) through a getCenter (k, h) based on a word meaning recognition algorithm learned in real time, and outputting X (k) = C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;

step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirement is met, stopping and storing the trained neural network language model;

and 3, step 3: inputting the word to be subjected to word sense embedding processing into a trained neural network language model, and outputting the word sense vector of the word to be subjected to word sense embedding processing based on the projection of the word;

judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center;

if the clustering centers corresponding to the polysemous words k exist, the distances between the environment vectors h and the corresponding clustering centers are respectively calculated, the minimum value of the distances between the environment vectors h and the corresponding clustering centers is searched and is recorded as min (L), if the min (L) is smaller than a minimum distance threshold value delta, a new clustering center corresponding to the polysemous words k is generated, and the environment vectors h are added into the newly generated clustering centers; otherwise, the environment vector h is merged into the clustering center corresponding to min (L) to obtain a new clustering center

Wherein O is _ki Representing the clustering center corresponding to min (L);

word sense vector C corresponding to clustering center based on environment vector h _ki The sense vector C (k, h) is obtained.

2. The method as claimed in claim 1, wherein in the step 2, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the occurrence of the word g to be predicted under the output X (k) of the projection layer.