CN110705274A - Fusion type word meaning embedding method based on real-time learning - Google Patents

Fusion type word meaning embedding method based on real-time learning Download PDF

Info

Publication number
CN110705274A
CN110705274A CN201910839702.1A CN201910839702A CN110705274A CN 110705274 A CN110705274 A CN 110705274A CN 201910839702 A CN201910839702 A CN 201910839702A CN 110705274 A CN110705274 A CN 110705274A
Authority
CN
China
Prior art keywords
word
vector
sense
neural network
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910839702.1A
Other languages
Chinese (zh)
Other versions
CN110705274B (en
Inventor
桂盛霖
方丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910839702.1A priority Critical patent/CN110705274B/en
Publication of CN110705274A publication Critical patent/CN110705274A/en
Application granted granted Critical
Publication of CN110705274B publication Critical patent/CN110705274B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a real-time learning-based fusion type word meaning embedding method, and belongs to the technical field of automatic generation of word vectors. The method comprises the steps of obtaining word sense vectors of words to be subjected to word sense embedding processing currently based on a neural network language model set by the neural network language model and based on projection output of the neural network language model; the input layer of the network structure of the neural network language model is used for acquiring a corresponding vector of a current word k in a preset word vector matrix V; the projection layer is used for judging the current word k, if the current word k is a univocal word, performing identity projection, and outputting a corresponding vector of the k in a preset word vector matrix V as the projection layer; if the word meaning vector is a multi-meaning word, the corresponding word meaning vector is obtained through a word meaning recognition algorithm based on real-time learning, and the projection layer outputs the obtained word meaning vector. The invention realizes the calculation and generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.

Description

Fusion type word meaning embedding method based on real-time learning
Technical Field
The invention belongs to the technical field of automatic generation of word vectors, and particularly relates to a fusion type word meaning embedding method based on real-time learning.
Background
In Natural Language Processing (NLP) related tasks, since machines cannot directly understand and analyze human languages, it is usually necessary to model natural languages and then provide them as input to a computer. Word vector (Word Representation) is a product of converting words in human language into abstract Representation, and the Word vectors commonly used at present have two types:
One-Hot replication: generating this type of word vector first requires counting all words in the corpus to generate a vocabulary N and a unique number for each word. For a word, the length of the corresponding generated word vector is | N |, the corresponding position of the word number in the word vector is 1, and the rest positions are 0. The problem of such vector representation is that it takes up a lot of space, resulting in high subsequent computation, and the word vector cannot characterize the relationship between words.
Distributed Representation this generation of word vectors overcomes the disadvantages of One-HotRecresentation. Distributed replication represents words as dense vectors. The generation of the vector is usually a byproduct of some language model training, words in a corpus are mapped to a word vector space through training of a corpus, and the relationship between the vectors is word semantics and expression of lexical relationships. The similarity of word semantics can thus be represented by the degree of approximation of the word vector values.
The current processing for generating word vectors can be divided into the following according to the granularity of the language unit corresponding to the word vector:
(1) word embedding: words in natural language are represented as vector data that a computer can process.
(2) Word sense embedding: specific semantics possessed by words in a natural language are expressed as vector data that can be processed by a computer.
Word sense embedding is one of the main drawbacks for word embedding class models: the problem that the word meaning of a polysemous word cannot be accurately expressed is that a word vector generation model which is more sensitive to the word meaning is gradually formed. The word meaning embedding class model can generate a plurality of word vectors corresponding to a plurality of semantemes of the multi-meaning words according to the semantic conditions of the multi-meaning words in the corpus, and the embedding model can be used for describing the words more accurately in semantic level. Currently, there are two main types of word sense embedding class models: a two-stage type and a fused type. The two-stage type means that the processes of word sense recognition and word vector generation are serially separated. The fusion model completes the word meaning recognition in the process of generating the word vector.
Schutze first proposed in 1998 to perform context grouping identification, to perform clustering by using the method of calculating the maximum expectation to identify word senses, and then to perform generation of word sense vectors. The subsequent two-stage models are substantially similar in concept and are typically different and optimized in terms of word sense recognition algorithms or text modeling. In 2010, Reisinger and Moone expressed the context as a unary grammar, and the MovFV clustering method was used to complete word sense recognition. The Sense2vec tool uses part-of-speech information to achieve word Sense separation, but has the disadvantage of not considering that the word senses of multiple word senses of a partially ambiguous word may be the same. The fusion model utilizes the commonality that both word sense recognition and word vector generation are used for calculating the text context, and combines the two processes to reduce the calculation consumption. Neelakantan expands on a Word2vec model to prepare a fixed number of vectors for each polysemous Word, and selects a proper vector to update in the training process, so that the defect that the number of the senses of different polysemous words is often different, and the limitation is high. Yang Liu et al then optimized the word vector and generated the defect that only utilized local information, proposed TWE model, added the topic information modeling information in the process.
Related research and tools in China are few at present, and once the research has been carried out, the LDA (Linear discriminanteanalysis) model is used for modeling the theme and carrying out semantic annotation on the ambiguous words. And the grand pine obtains the semantic vector by utilizing a Chinese knowledge base HowNet to further learn the word vector. The Liquajia uses K-Means clustering to construct a two-stage model in the word sense identification stage, and the method has the defects that the K-Means algorithm needs to give the number of central clusters in advance, namely the number of generated word senses needs to be determined in advance, and the expandability is not good enough.
The existing word vector tool technology mainly comprises two types of word level embedding and word meaning level embedding.
Among them, the word level embedding has the disadvantages that: 1) word vectors generated by word training with multiple word senses are more biased to have more semantics in the corpus, and less semantics in the corpus are weakened; 2) calculating contents which are irrelevant to semantics and appear in a result with higher similarity with a certain polysemous word; 3) original triangle inequality of a word vector space is destroyed, so that the quality of the word vector space is reduced; the word sense level embedding model can be divided into a two-stage type and a fusion type, wherein the two-stage type model has the defects that the similarity between the word sense identification process and the vector generation process is neglected, the two processes are completed in series, and the efficiency is low. The fusion model cannot use the clustering algorithm with better effect, such as K-Means and DBSCAN, and the effect is usually inferior to that of the two-stage model.
Disclosure of Invention
The invention aims to: aiming at the existing problems, the method realizes the calculation and the generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.
The invention discloses a fusion type word meaning embedding method based on real-time learning, which comprises the following steps:
step 1: setting a neural network language model;
the network structure of the neural network language model comprises an input layer and a projection layer;
the input layer is used for acquiring a corresponding vector V (k) of a current word k in a preset word vector matrix V;
the projection layer is used for judging the current word k, if the current word k is a univocal word, performing identity projection, and outputting X (k) to V (k); if the word is a polysemous word, acquiring a corresponding word sense vector C (k, h) through a word sense recognition algorithm getCenter (k, h) based on real-time learning, and outputting X (k) ═ C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;
step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirements are met (if the maximum training round number is reached, the output loss change rate meets the precision requirement and the like), stopping and storing the trained neural network language model;
and step 3: and inputting the word to be subjected to the word sense embedding processing into the trained neural network language model, and outputting the word sense vector of the word to be subjected to the word sense embedding processing based on the projection of the word.
The real-time learning-based word sense recognition algorithm getCenter (k, h) comprises the following specific processing procedures:
judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center; during training, the initial value of the set O is an empty set, and the training is finished continuously to obtain the initial value.
If the clustering center corresponding to the polysemous word k exists, respectively calculating the distance between the environment vector h and each corresponding clustering center, and recording the minimum value of the distance between the environment vector h and each corresponding clustering center as min (L), if min (L) is less than a minimum distance threshold value delta, generating a new clustering center corresponding to the polysemous word k and adding the environment vector h into the newly generated clustering center; otherwise, the environment vector h is merged into the corresponding clustering center of min (L) to obtain a new clustering centerWherein O iskiRepresents min (L) corresponding clustering center;
word sense vector C corresponding to clustering center based on environment vector hkiObtaining a word sense vector C (k, h);
further, in order to accelerate the training process of the neural network language model, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the word g to be predicted appearing under the output X (k) of the projection layer.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: the invention combines the word meaning recognition process and the word vector generation process based on real-time learning, further ensures the accuracy of the word meaning recognition process by means of the real-time learning type clustering algorithm under the condition of ensuring the calculation efficiency, and finally improves the quality of the word meaning vector.
Drawings
FIG. 1 is a schematic diagram of a neural network structure employed in the present invention according to an embodiment;
FIG. 2 is a diagram illustrating a clustering process according to an embodiment;
FIG. 3 is a schematic diagram illustrating an implementation process of the fusion word sense embedding method based on real-time learning according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a training process in the embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
The invention discloses a fusion type word sense embedding method based on real-time learning, which combines a word sense identification process based on real-time learning and a word vector generation process to generate a plurality of word sense vectors corresponding to word senses of multi-sense words. The method solves the problem that the ambiguous words in the traditional word vector only correspond to a single vector; the recognition and generation processes are combined, so that the calculation efficiency is high; the real-time learning type clustering algorithm is utilized, and the accuracy of word meaning identification is ensured.
Word vectors are typically a byproduct of the generation of training language models. Word vectors obtained using different language models will also vary. In the specific embodiment, a neural network language model is adopted, namely a three-layer neural network is built, the occurrence probability of the environmental word g is predicted and calculated by the current word k, a word sense recognition algorithm based on real-time learning is added in the word vector generation process, the calculation and generation of the word sense vector of the polysemous word are realized, and the overall structure of the adopted neural network is shown in fig. 1.
The network structure comprises an input layer, a projection layer and an output layer, and the calculation performed by each layer is as follows:
(1) input layer (Input Laer): in the model initialization stage, a word vector matrix V with the size of m | D | is prestored according to the set length m of the word vector, wherein V | D |1,v2… denotes the matrix V element and D represents the sample library excluding vocabulary sets that occur less frequently than the lowest frequency set by the user. At the input layerAcquiring a corresponding vector V (k) of a current word k in a word vector matrix V;
referring to fig. 1, an input sample is (k, h (k)) g, where k denotes a current word, h (k) represents a set of environmental words of the word k under a window W, and g represents a word to be predicted.
(2) Projection Layer (Projection Layer): judging a current word k, if the current word k is a univocal word (monoseme), performing an identity projection, and outputting X (k) to a projection layer, wherein the k is V (k); if the word is a polysemy word, acquiring a word sense vector C (k, h) corresponding to the word sense vector C (k, h) by calling a word sense recognition algorithm getCenter (k, h) based on real-time learning, wherein h represents an environment vector (i.e., an environment word) of a current word k, and outputting x (k) ═ C (k, h);
(3) output Layer (ouput Layer): for outputting the probability of the occurrence of the word g to be predicted at the projection layer output x (k). In the specific embodiment, the Hierachy Softmax algorithm is adopted to output the probability of the word g to be predicted appearing under the output X (k) of the projection layer, and the output layer adopts a Huffman tree structure. And taking the vocabulary in the vocabulary set D as leaf nodes of the Huffman tree, wherein the non-leaf nodes in the tree represent parameters of the neural network.
In each training iteration, the parameters to be updated in the network include a word vector v (k), a word sense vector c (k) and output layer non-leaf node parameters. Finally, the word vector obtained by training is composed of a word vector matrix V and a word sense vector set C, the word vector corresponding to the word with the meaning is in the word vector matrix V, and the word sense word vector of the polysemous word is in the word sense vector set C.
Wherein, the word sense recognition algorithm getCenter (k, h) based on real-time learning is as follows:
in the process of word sense identification, a definition set C represents a polysemous word sense vector set, a set O represents a cluster center set (), when a current word k is a polysemous word, each corresponding cluster center in O is marked as OkiEach cluster center is correspondingly provided with a sense vector C in Cki
The getCenter (k, h) function updates the corresponding clustering center through the ambiguous word k and the environment vector h thereof, and returns the clustering center O where the ambiguous word k is located in the environment vector (condition) hkiCorresponding word sense vector CkiMemory for recordingIs the word sense vector C (k, h).
Referring to fig. 2, the specific process of getCenter (k, h) is as follows:
1) inputting: an ambiguous word k, an environment vector h, a minimum distance threshold δ;
2) judging whether a cluster center corresponding to the polysemous word k exists in the set O, if not, generating a new distance center (cluster) O for the cluster centerk1And h is added to the newly generated cluster Ok1Middle, turn 4); cluster Ok1The cluster center of is Ok1
If a plurality of clustering centers corresponding to the ambiguous word k exist: o isk1,Ok2,…,OknThen, h and the corresponding cluster center O are calculated respectivelykiDistance L (O)kiH), the distance set L ═ L (O) is obtainedk1,h),L(Ok2,h),…,L(OknH), wherein the calculation distance formula is shown as formula (1).
Figure BDA0002193297080000051
Wherein i is 1,2, …, n. n represents the current number of cluster centers,represents a cluster center OkiOf elements, i.e. cluster core OkiIs an n' -dimensional vector, OkiThe element of (a) is its corresponding vector element; h isjThe jth vector element representing the context vector.
3) Finding the minimum distance min (L) and the corresponding cluster center Oki(ii) a If min (L)<δ, then generating a new cluster O for itk(n+1)So as to obtain a new cluster center set O corresponding to the polysemous word kkComprises the following steps: o isk={Ok1,Ok2,…,Okn,Ok(n+1)H is added to cluster Ok(n+1)Performing the following steps; otherwise incorporate h into OkiUpdating a cluster center, wherein an updating formula of the cluster center is shown in a formula (2);
Figure BDA0002193297080000053
4) and (3) outputting: h cluster center OkiCorresponding word sense vector Cki
The Hierachy Softmax algorithm is an optimization algorithm for improving the training efficiency of the neural network. The Huffman tree constructed in the output layer provides a structural basis for the implementation of the algorithm. First, the related concepts in the Huffman tree are defined:
1) the path from the root node of the Huffman tree to the leaf node k is called pk
2) Path pkThe number of nodes contained is lk
3) The nodes on the path are represented as: p is a radical ofk 1,pk 2,…,pk lkWherein p isk 1Representing a root node, and lk represents the number of nodes;
4) huffman coding of nodes on a path is represented as: dk 2,dk 3,…,dk lkE to {0,1}, and the root node does not correspond to the code;
5) the parameter vector for a node on a path is represented as: thetak 1k 2,…,θk lk;θk jAnd representing the parameter vector of the j (j ═ 1,2, …, lk) th node on the path.
The core idea of the Hierachy Softmax algorithm is as follows: for a word k in the vocabulary set D, there must be and only one path p from the root node to the node k in the Huffman treekOn the path is lkA node,/k-1 branch. And regarding each binary branch as a classification process, defining the node with the Huffman code of 0 as a positive class, and regarding the node with the Huffman code of 1 as a negative class. Because the nodes are provided with parameter vectors, in each two-classification process, the probability of the nodes is divided into positive classes and negative classes, namely the probability of the nodes is divided into the formula (3) and the formula (4):
Figure BDA0002193297080000061
Figure BDA0002193297080000062
wherein, σ (-) represents a distribution function, θ represents a vector formed by parameter vectors of the first j-1 nodes, and e represents a natural base number.
The overall probability is obtained by multiplying all the branch probabilities, as shown in equation (5):
Figure BDA0002193297080000063
substituting the formula (5) into the maximum likelihood function, calculating the gradient by adopting random gradient rise, and in the network, updating the corresponding vector in the matrix V and the related path in the Huffman tree
Figure BDA0002193297080000064
And (4) parameters. When the current word k is a polysemous word, the corresponding word sense vector C (k, h) also needs to be updated.
Referring to fig. 3 and 4, the implementation steps of the present invention include the following steps:
step 1: preprocessing text data and initializing a model.
In this embodiment, the text data is subjected to word segmentation processing and then used as training data T.
Then, firstly, an empty model object is established, and parameters such as a sample window size W, a lowest word frequency F, a generated word vector length and a training round are set.
Initializing a built-in model dictionary after the model is built: and generating a dictionary D according to the words appearing in the training data T and the frequency of the words, discarding the words with the word frequency lower than that of F in the dictionary D, and sequencing the D according to the word frequency. And then generating a built-in word vector matrix V according to the dictionary D, and generating a Huffman tree structure so as to accelerate the training process by utilizing a Hierachy Softmax algorithm, wherein leaf nodes of the Huffman tree are words in the dictionary D, and non-leaf nodes are used as network parameters. And finally, carrying out random initialization on the word vector matrix V and the network parameter values.
After the initialization is finished, the initialization is startedAnd (5) training the T. Single sample T in TiCan be represented as tiWhere k is the current word, h (k) represents the set of ambient words for k under the window W, and g represents the word to be predicted.
Step 2: forward propagation and real-time learning.
In the training process, for a single sample T in Ti(k, h (k), g) is calculated layer by layer in a three-layer neural network structure: firstly, obtaining a corresponding vector V (k) of a current word k in a word vector matrix V at an input layer, then, judging k at a projection layer, and if k is a polysemous word, a corresponding environment vector h can be expressed as a formula (6):
Figure BDA0002193297080000071
after the environment vector is calculated, a word sense recognition algorithm based on real-time learning is called to obtain a corresponding sense item vector C (k, h) of k under H (k) (-getCenter (k, h)), wherein C represents a word sense word vector set. Let projection layer output x (k) ═ C (k, h); if the current word is a univocal word, the projection layer outputs x (k) to v (k); and finally, calculating the probability of the word g under X (k) according to a Hierachy Softmax algorithm in an output layer.
And step 3: and updating the network parameters.
After one-time sample forward propagation is completed, the gradient is calculated by adopting random gradient rise and parameters in the network are updated, and in the network, corresponding vectors in the word vector matrix V which needs to be updated and related paths in the Huffman tree
Figure BDA0002193297080000072
And (3) parameters, when the current word k is a multi-meaning word, the corresponding word sense vector C (k, h) needs to be updated. In the model
Figure BDA0002193297080000073
See formula (7) for updating V corresponding to the vector, see formula (8) for updating V corresponding to the vector, and see formula (9) for updating c (k), where τ represents the learning rate set by the model. Obtaining a final word vector matrix through repeated iterative training and parameter updating of the sample setV and C, V stores word vectors corresponding to the words, and C stores word vectors corresponding to the words.
Figure BDA0002193297080000074
Figure BDA0002193297080000075
Figure BDA0002193297080000076
And 4, step 4: and generating a word sense vector.
In order to improve the quality of generating the sense vectors, after the training is finished, the clusters with less sample number in the clusters formed by clustering and the corresponding sense vectors are deleted, namely the related sense vectors C of the polysemous words k in the sense vector set CkiChecking its corresponding cluster center (cluster center) OkiIf cluster center O is clusteredkiIf the number of corresponding clusters is less than the threshold value m, deleting the cluster and the word sense vector CkiReducing the word vector generation error caused by clustering to obtain the final word sense vector set C, CkiRepresenting the word sense vector corresponding to the ambiguous word k.
After the word sense vector model is established, the word sense representation of the polysemous words is more accurate due to the word vector model compared with the traditional word vector model, and the algorithm effect of related application based on word vectors can be further improved, for example, the method can be applied to the related fields of natural language processing such as word similarity calculation, word sense disambiguation, text classification and the like. The specific processing procedure for realizing the above three applications based on the word sense vector is as follows:
1) calculating the similarity between words based on the word sense vector:
in the word sense vector model, each word is represented as one to more word sense vectors according to its number of sense items. When the similarity between two words is calculated, the similarity can be converted into cosine similarity between vectors for measurement. Assuming that the vectors are a and b, the cosine similarity calculation formula therebetween is shown in formula (10):
Figure BDA0002193297080000081
where n represents the dimensions of vectors a and b, ai、biRepresenting the vector elements of a and b, respectively.
Assuming that the words to be compared are c and d, because the polysemous words have a plurality of word sense vectors, when the similarity between words is calculated, the cosine similarity between each word sense vector in the word c and each word sense vector in the word d needs to be calculated, and finally the maximum cosine similarity among the output results is taken as the similarity between the word c and the word d.
2) Word sense disambiguation based on word sense vectors:
when disambiguating an ambiguous word, the word sense vector model may be used to determine, and first, the context C ═ C of the ambiguous word to be disambiguated is obtained1,c2,…,cnIn which c isiRepresenting words in a context window; subsequently, c is readiCorresponding sense vector e in the sense vector modeliAnd summing and averaging the two to obtain a context environment vector h, namely, a formula (11):
Figure BDA0002193297080000082
and finally, calculating cosine similarity between the context environment vector h and the word sense vector of the polysemous word to be disambiguated, and outputting the word sense corresponding to the word sense vector with the maximum cosine similarity to complete the disambiguation.
3) Text classification based on word sense vectors:
because the ambiguous word has only unique semantics in a specific context, if the corresponding semantics of the ambiguous word in the context are defined, the accuracy of text classification can be effectively improved.
Firstly, carrying out (11) disambiguation based on word sense vectors on the polysemous words in the text to be classified to obtain a word sense vector corresponding to each polysemous word; then, reading a corresponding vector of the word sense of the word in the text in the word sense vector model and the word sense vector of the polysemous word obtained in the previous step, and accumulating to obtain a vector w, wherein the vector w is a text vector; and finally, establishing a proper classifier, and training the classifier by using the text vector set to obtain a trained classification model.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims (3)

1. The fusion type word sense embedding method based on real-time learning is characterized by comprising the following steps of:
step 1: setting a neural network language model;
the network structure of the neural network language model comprises an input layer and a projection layer;
the input layer is used for acquiring a corresponding vector V (k) of a current word k in a preset word vector matrix V;
the projection layer is used for judging the current word k, if the current word k is a univocal word, performing identity projection, and outputting X (k) to V (k); if the word is a polysemous word, acquiring a corresponding word sense vector C (k, h) through a word sense recognition algorithm getCenter (k, h) based on real-time learning, and outputting X (k) ═ C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;
step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirement is met, stopping and storing the trained neural network language model;
and step 3: and inputting the word to be subjected to the word sense embedding processing into the trained neural network language model, and outputting the word sense vector of the word to be subjected to the word sense embedding processing based on the projection of the word.
2. The method of claim 1, wherein the real-time learning-based word sense recognition algorithm getCenter (k, h) is processed by:
judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center;
if the clustering center corresponding to the polysemous word k exists, respectively calculating the distance between the environment vector h and each corresponding clustering center, and recording the minimum value of the distance between the environment vector h and each corresponding clustering center as min (L), if min (L) is less than a minimum distance threshold value delta, generating a new clustering center corresponding to the polysemous word k and adding the environment vector h into the newly generated clustering center; otherwise, the environment vector h is merged into the corresponding clustering center of min (L) to obtain a new clustering centerWherein O iskiRepresents min (L) corresponding to the cluster center.
Word sense vector C corresponding to clustering center based on environment vector hkiThe sense vector C (k, h) is obtained.
3. The method as claimed in claim 1, wherein in the step 2, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the occurrence of the word g to be predicted under the output x (k) of the projection layer.
CN201910839702.1A 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning Expired - Fee Related CN110705274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910839702.1A CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910839702.1A CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Publications (2)

Publication Number Publication Date
CN110705274A true CN110705274A (en) 2020-01-17
CN110705274B CN110705274B (en) 2023-03-24

Family

ID=69194339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910839702.1A Expired - Fee Related CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Country Status (1)

Country Link
CN (1) CN110705274B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN112989051A (en) * 2021-04-13 2021-06-18 北京世纪好未来教育科技有限公司 Text classification method, device, equipment and computer readable storage medium

Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339322A (en) * 2011-11-10 2012-02-01 武汉大学 Word meaning extracting method based on search interactive information and user search intention
US20140129276A1 (en) * 2012-11-07 2014-05-08 Sirion Labs Method and system for supplier management
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105989125A (en) * 2015-02-16 2016-10-05 苏宁云商集团股份有限公司 Searching method and system for carrying out label identification on resultless word
CN106484685A (en) * 2016-10-21 2017-03-08 长沙市麓智信息科技有限公司 Patent real-time learning system and its learning method
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN106897950A (en) * 2017-01-16 2017-06-27 北京师范大学 One kind is based on word cognitive state Model suitability learning system and method
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN107656963A (en) * 2017-08-11 2018-02-02 百度在线网络技术(北京)有限公司 Vehicle owner identification method and device, computer equipment and computer-readable recording medium
US20180068371A1 (en) * 2016-09-08 2018-03-08 Adobe Systems Incorporated Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models
CN108268449A (en) * 2018-02-10 2018-07-10 北京工业大学 A kind of text semantic label abstracting method based on lexical item cluster
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
US20180253638A1 (en) * 2017-03-02 2018-09-06 Accenture Global Solutions Limited Artificial Intelligence Digital Agent
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109271635A (en) * 2018-09-18 2019-01-25 中山大学 A kind of term vector improved method of insertion outside dictinary information
WO2019032307A1 (en) * 2017-08-07 2019-02-14 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
US20190156274A1 (en) * 2017-08-07 2019-05-23 Standard Cognition, Corp Machine learning-based subject tracking
CN109859554A (en) * 2019-03-29 2019-06-07 上海乂学教育科技有限公司 Adaptive english vocabulary learning classification pushes away topic device and computer learning system
CN109960811A (en) * 2019-03-29 2019-07-02 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Patent Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339322A (en) * 2011-11-10 2012-02-01 武汉大学 Word meaning extracting method based on search interactive information and user search intention
US20140129276A1 (en) * 2012-11-07 2014-05-08 Sirion Labs Method and system for supplier management
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN105989125A (en) * 2015-02-16 2016-10-05 苏宁云商集团股份有限公司 Searching method and system for carrying out label identification on resultless word
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
US20180068371A1 (en) * 2016-09-08 2018-03-08 Adobe Systems Incorporated Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models
CN106484685A (en) * 2016-10-21 2017-03-08 长沙市麓智信息科技有限公司 Patent real-time learning system and its learning method
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN106897950A (en) * 2017-01-16 2017-06-27 北京师范大学 One kind is based on word cognitive state Model suitability learning system and method
US20180253638A1 (en) * 2017-03-02 2018-09-06 Accenture Global Solutions Limited Artificial Intelligence Digital Agent
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
WO2019032307A1 (en) * 2017-08-07 2019-02-14 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
US20190156274A1 (en) * 2017-08-07 2019-05-23 Standard Cognition, Corp Machine learning-based subject tracking
CN107656963A (en) * 2017-08-11 2018-02-02 百度在线网络技术(北京)有限公司 Vehicle owner identification method and device, computer equipment and computer-readable recording medium
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN108268449A (en) * 2018-02-10 2018-07-10 北京工业大学 A kind of text semantic label abstracting method based on lexical item cluster
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109271635A (en) * 2018-09-18 2019-01-25 中山大学 A kind of term vector improved method of insertion outside dictinary information
CN109859554A (en) * 2019-03-29 2019-06-07 上海乂学教育科技有限公司 Adaptive english vocabulary learning classification pushes away topic device and computer learning system
CN109960811A (en) * 2019-03-29 2019-07-02 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金保华 等: ""基于词义消歧的短文本情感分类方法研究"", 《现代计算机(专业版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN111783418B (en) * 2020-06-09 2024-04-05 北京北大软件工程股份有限公司 Chinese word meaning representation learning method and device
CN112989051A (en) * 2021-04-13 2021-06-18 北京世纪好未来教育科技有限公司 Text classification method, device, equipment and computer readable storage medium
CN112989051B (en) * 2021-04-13 2021-09-10 北京世纪好未来教育科技有限公司 Text classification method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN110705274B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN108009148B (en) Text emotion classification representation method based on deep learning
CN111027595B (en) Double-stage semantic word vector generation method
CN110263325B (en) Chinese word segmentation system
CN110705296A (en) Chinese natural language processing tool system based on machine learning and deep learning
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN107273352B (en) Word embedding learning model based on Zolu function and training method
CN110597961A (en) Text category labeling method and device, electronic equipment and storage medium
Zhang et al. Hotel reviews sentiment analysis based on word vector clustering
CN112818118A (en) Reverse translation-based Chinese humor classification model
CN112836051A (en) Online self-learning court electronic file text classification method
JP2016170636A (en) Connection relationship estimation device, method, and program
CN110705274B (en) Fusion type word meaning embedding method based on real-time learning
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN113609849A (en) Mongolian multi-mode fine-grained emotion analysis method fused with priori knowledge model
CN112287656A (en) Text comparison method, device, equipment and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN117217277A (en) Pre-training method, device, equipment, storage medium and product of language model
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
CN115329075A (en) Text classification method based on distributed machine learning
CN108536838A (en) Very big unrelated multivariate logistic regression model based on Spark is to text sentiment classification method
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN110309252B (en) Natural language processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230324