CN110705274B - Fusion type word meaning embedding method based on real-time learning - Google Patents

Fusion type word meaning embedding method based on real-time learning Download PDF

Info

Publication number
CN110705274B
CN110705274B CN201910839702.1A CN201910839702A CN110705274B CN 110705274 B CN110705274 B CN 110705274B CN 201910839702 A CN201910839702 A CN 201910839702A CN 110705274 B CN110705274 B CN 110705274B
Authority
CN
China
Prior art keywords
word
vector
neural network
sense
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910839702.1A
Other languages
Chinese (zh)
Other versions
CN110705274A (en
Inventor
桂盛霖
方丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910839702.1A priority Critical patent/CN110705274B/en
Publication of CN110705274A publication Critical patent/CN110705274A/en
Application granted granted Critical
Publication of CN110705274B publication Critical patent/CN110705274B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a real-time learning-based fusion type word meaning embedding method, and belongs to the technical field of automatic generation of word vectors. The method comprises the steps of obtaining word sense vectors of words to be subjected to word sense embedding processing currently based on a neural network language model set by the neural network language model and based on projection output of the neural network language model; the input layer of the network structure of the neural network language model is used for acquiring a corresponding vector of a current word k in a preset word vector matrix V; the projection layer is used for judging the current word k, if the current word k is a univocal word, performing identity projection, and outputting a corresponding vector of the k in a preset word vector matrix V as the projection layer; if the word meaning vector is a multi-meaning word, the corresponding word meaning vector is obtained through a word meaning recognition algorithm based on real-time learning, and the projection layer outputs the obtained word meaning vector. The invention realizes the calculation and generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.

Description

Fusion type word meaning embedding method based on real-time learning
Technical Field
The invention belongs to the technical field of automatic generation of word vectors, and particularly relates to a fusion type word meaning embedding method based on real-time learning.
Background
In Natural Language Processing (NLP) related tasks, since machines cannot directly understand and analyze human languages, it is usually necessary to model natural languages and then provide them as input to a computer. Word vector (Word Representation) is a product of converting words in human language into abstract Representation, and the Word vectors commonly used at present have two types:
One-Hot replication: generating this type of word vector first requires counting all words in the corpus to generate a vocabulary N and a unique number for each word. For a word, the length of the corresponding generated word vector is | N |, the corresponding position of the word number in the word vector is 1, and the rest positions are 0. The problem of such vector representation is that it takes up a lot of space, resulting in high subsequent computation, and the word vector cannot characterize the relationship between words.
Distributed Representation generation this generation of word vectors overcomes the disadvantages of One-Hot Representation. Distributed replication represents words as dense vectors. The generation of the vector is usually a byproduct of some language model training, words in a corpus are mapped to a word vector space through training of a corpus, and the relationship between the vectors is word semantics and expression of lexical relationships. The similarity of word semantics can thus be represented by the degree of approximation of the word vector values.
The current processing for generating word vectors can be divided into the following according to the granularity of the language unit corresponding to the word vector:
(1) Word embedding: words in natural language are represented as vector data that a computer can process.
(2) Word sense embedding: specific semantics possessed by words in a natural language are expressed as vector data that can be processed by a computer.
Word sense embedding is one of the main drawbacks for word embedding class models: the problem that the word meaning of a polysemous word cannot be accurately expressed is that a word vector generation model which is more sensitive to the word meaning is gradually formed. The word meaning embedding class model can generate a plurality of word vectors corresponding to a plurality of semantemes of the multi-meaning words according to the semantic conditions of the multi-meaning words in the corpus, and the embedding model can be used for describing the words more accurately in semantic level. Currently, there are two main types of word sense embedding class models: a two-stage type and a fused type. The two-stage type means that the processes of word sense recognition and word vector generation are serially separated. The fusion model completes the word meaning recognition in the process of generating the word vector.
Schutze first proposed in 1998 to perform context grouping identification, to perform clustering by using the method of calculating the maximum expectation to identify word senses, and then to perform generation of word sense vectors. The subsequent two-stage models are substantially similar in concept and are typically different and optimized in terms of word sense recognition algorithms or text modeling. In 2010, reisinger and Moone expressed the context as a unary grammar, and the MovFV clustering method was used to complete word sense recognition. The Sense2vec tool uses part-of-speech information to achieve word Sense separation, but has the disadvantage of not considering that the word senses of multiple word senses of a partially ambiguous word may be the same. The fusion model utilizes the commonality that both word sense recognition and word vector generation are used for calculating the text context, and combines the two processes to reduce the calculation consumption. Neelakantan expands on a Word2vec model to prepare a fixed number of vectors for each polysemous Word, and selects a proper vector to update in the training process, so that the defect that the number of the senses of different polysemous words is often different, and the limitation is high. Yang Liu et al have optimized the word vector and generated the defect that only utilizes local information, propose TWE model, have added the theme information modeling information in the course.
Related research and tools in China are few at present, and the LDA (Linear discrete Analysis) model is used for modeling the subject and carrying out semantic annotation on the ambiguous word. The Sullocen obtains an semantic vector by utilizing a Chinese knowledge base HowNet to further learn the word vector. The Liquajia uses K-Means clustering to construct a two-stage model in the word sense identification stage, and the method has the defects that the K-Means algorithm needs to give the number of central clusters in advance, namely the number of generated word senses needs to be determined in advance, and the expandability is not good enough.
The existing word vector tool technology mainly comprises two categories of word level embedding and word meaning level embedding.
Among them, the word level embedding has the disadvantages that: 1) Word vectors generated by word training with multiple word senses are more biased to have more semantics in the corpus, and less semantics in the corpus are weakened; 2) Calculating contents which are irrelevant to semantics and appear in a result with higher similarity with a certain polysemous word; 3) Original triangle inequality of a word vector space is destroyed, so that the quality of the word vector space is reduced; the word sense level embedding model can be divided into a two-stage type and a fusion type, wherein the two-stage type model has the defects that the similarity between the word sense identification process and the vector generation process is neglected, the two processes are completed in series, and the efficiency is low. The fusion model cannot use the clustering algorithm with better effect, such as K-Means and DBSCAN, and the effect is usually inferior to that of the two-stage model.
Disclosure of Invention
The invention aims to: aiming at the existing problems, the method realizes the calculation and the generation of the word meaning vector of the polysemous word by using a real-time learning method, and improves the quality of the generated vector on the premise of ensuring the calculation efficiency of the word meaning vector.
The invention discloses a fusion type word meaning embedding method based on real-time learning, which comprises the following steps:
step 1: setting a neural network language model;
the network structure of the neural network language model comprises an input layer and a projection layer;
the input layer is used for acquiring a corresponding vector V (k) of a current word k in a preset word vector matrix V;
the projection layer is used for judging the current word k, if the current word k is a univocal word, the projection layer performs an identity projection, and the projection layer outputs X (k) = V (k); if the word is a polysemous word, acquiring a corresponding word sense vector C (k, h) through a word sense recognition algorithm getCenter (k, h) based on real-time learning, and outputting X (k) = C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;
step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirements are met (if the maximum training round number is reached, the output loss change rate meets the precision requirement and the like), stopping and storing the trained neural network language model;
and step 3: and inputting the word to be subjected to the word sense embedding processing into the trained neural network language model, and outputting the word sense vector of the word to be subjected to the word sense embedding processing based on the projection of the word.
The real-time learning-based word sense recognition algorithm getCenter (k, h) comprises the following specific processing procedures:
judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center; during training, the initial value of the set O is an empty set, and the training is finished continuously to obtain the initial value.
If the clustering center corresponding to the polysemous word k exists, respectively calculating the distance between the environment vector h and each corresponding clustering center, marking the minimum value of the distance between the environment vector h and each corresponding clustering center as min (L), and if the min (L) is smaller than a minimum distance threshold value delta, generating a new clustering center corresponding to the polysemous word k and adding the environment vector h into the newly generated clustering center; otherwise, the environment vector h is merged into the clustering center corresponding to min (L) to obtain a new clustering center
Figure BDA0002193297080000031
Wherein O is ki Representing the clustering center corresponding to min (L); />
Word sense vector C corresponding to clustering center based on environment vector h ki Obtaining a word sense vector C (k, h);
further, in order to accelerate the training process of the neural network language model, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the occurrence of the word g to be predicted under the output X (k) of the projection layer.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: the invention combines the word meaning recognition process and the word vector generation process based on real-time learning, further ensures the accuracy of the word meaning recognition process by means of the real-time learning type clustering algorithm under the condition of ensuring the calculation efficiency, and finally improves the quality of the word meaning vector.
Drawings
FIG. 1 is a schematic diagram of a neural network structure employed in the present invention in an exemplary embodiment;
FIG. 2 is a diagram illustrating a clustering process according to an embodiment;
FIG. 3 is a schematic diagram illustrating an implementation process of the fusion word sense embedding method based on real-time learning according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a training process in the embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
The invention discloses a fusion type word sense embedding method based on real-time learning, which combines a word sense identification process based on real-time learning and a word vector generation process to generate a plurality of word sense vectors corresponding to word senses of multi-sense words. The method solves the problem that the ambiguous words in the traditional word vector only correspond to a single vector; the recognition and generation processes are combined, so that the calculation efficiency is high; the real-time learning type clustering algorithm is utilized, and the accuracy of word meaning identification is ensured.
Word vectors are typically a byproduct of the generation of training language models. Word vectors obtained using different language models will also vary. In the specific embodiment, a neural network language model is adopted, namely a three-layer neural network is built, the occurrence probability of the environmental word g is predicted and calculated by the current word k, a word sense recognition algorithm based on real-time learning is added in the word vector generation process, the calculation and generation of the word sense vector of the polysemous word are realized, and the overall structure of the adopted neural network is shown in fig. 1.
The network structure comprises an input layer, a projection layer and an output layer, and the calculation performed by each layer is as follows:
(1) Input layer (Input Laer): in the model initialization stage, a word vector matrix V with the size of m | D | is prestored according to the set length m of the word vector, wherein V | D | 1 ,v 2 \8230denotesthe matrix V element, D represents the sample library except for the set of words that occur less frequently than the lowest frequency set by the user. In an input layer, acquiring a corresponding vector V (k) of a current word k in a word vector matrix V;
referring to fig. 1, an input sample = (k, H (k) g), where k denotes a current word, H (k) denotes a set of ambient words of the word k under a window W, and g denotes a word to be predicted.
(2) Projection Layer (Projection Layer): judging a current word k, if the current word k is a univocal word (monoseme), performing equal projection, and outputting X (k) = V (k) on a projection layer; if the word is a polysemy word, acquiring a word sense vector C (k, h) corresponding to the word sense vector by calling a word sense recognition algorithm getCenter (k, h) based on real-time learning, wherein h represents an environment vector (namely an environment word) of a current word k, and outputting X (k) = C (k, h);
(3) Output Layer (ouput Layer): for outputting the probability of the occurrence of the word g to be predicted at the projection layer output X (k). In the specific embodiment, the Hierachy Softmax algorithm is adopted to output the probability of the word g to be predicted appearing under the output X (k) of the projection layer, and the output layer adopts a Huffman tree structure. And taking the vocabulary in the vocabulary set D as leaf nodes of the Huffman tree, wherein the non-leaf nodes in the tree represent parameters of the neural network.
In each training iteration, the parameters to be updated in the network include a word vector V (k), a word sense vector C (k) and output layer non-leaf node parameters. Finally, the word vector obtained by training is composed of a word vector matrix V and a word sense vector set C, the word vector corresponding to the word with the meaning is in the word vector matrix V, and the word sense word vector of the polysemous word is in the word sense vector set C.
Wherein, the word sense recognition algorithm getCenter (k, h) based on real-time learning is as follows:
in the process of word sense identification, a definition set C represents a polysemous word sense vector set, a set O represents a cluster center set (), when a current word k is a polysemous word, each corresponding cluster center in O is marked as O ki Each cluster center is correspondingly provided with a sense vector C in C ki
The getCenter (k, h) function updates the corresponding clustering center through the ambiguous word k and the environment vector h thereof, and returns the clustering center O of the ambiguous word k in the environment vector (condition) h ki Corresponding word sense vector C ki And is denoted as word sense vector C (k, h).
Referring to FIG. 2, the specific process for getcenter (k, h) is:
1) Inputting: an ambiguous word k, an environment vector h, a minimum distance threshold δ;
2) Judging whether a cluster center corresponding to the polysemous word k exists in the set O or not, if not, generating a new distance center (cluster) O for the cluster O k1 And h is added to the newly generated cluster O k1 Middle, turn 4); cluster O k1 The cluster center of is O k1
If a plurality of clustering centers corresponding to the ambiguous word k exist: o is k1 ,O k2 ,…,O kn Then h and the corresponding cluster center O are calculated respectively ki Distance L (O) ki H) to obtain a set of distances L = { L (O) = k1 ,h),L(O k2 ,h),…,L(O kn H), wherein the calculation distance formula is shown as formula (1).
Figure BDA0002193297080000051
Wherein i =1,2, \8230;, n. n represents the current number of cluster centers,
Figure BDA0002193297080000052
represents a cluster center O ki Of elements, i.e. cluster core O ki Is an n' -dimensional vector, O ki The element of (b) is its corresponding vector element; h is j The jth vector element representing the context vector.
3) Finding the minimum distance min (L) and the corresponding cluster center O ki (ii) a If min (L)<δ, then generating a new cluster O for it k(n+1) So as to obtain a new cluster center set O corresponding to the polysemous word k k Comprises the following steps: o is k ={O k1 ,O k2 ,…,O kn ,O k(n+1) H is added to cluster O k(n+1) Performing the following steps; otherwise incorporate h into O ki Updating a cluster center, wherein an updating formula of the cluster center is shown in a formula (2);
Figure BDA0002193297080000053
4) And (3) outputting: h cluster center O ki Corresponding word sense vector C ki
The Hierachy Softmax algorithm is an optimization algorithm for improving the training efficiency of the neural network. The Huffman tree constructed in the output layer provides a structural basis for the implementation of the algorithm. First, the related concepts in the Huffman tree are defined:
1) The path from the root node of the Huffman tree to the leaf node k is called p k
2) Path p k The number of nodes contained is l k
3) The nodes on the path are represented as: p is a radical of k 1 ,p k 2 ,…,p k lk Wherein p is k 1 Representing a root node, and lk represents the number of nodes;
4) Huffman coding of nodes on a path is represented as: d is a radical of k 2 ,d k 3 ,…,d k lk E to {0,1}, and the root node does not correspond to the code;
5) The parameter vector for a node on a path is represented as: theta k 1k 2 ,…,θ k lk ;θ k j A parameter vector representing the jth (j =1,2, \ 8230;, lk) node on the path.
The core idea of the Hierachy Softmax algorithm is as follows: for a word k in the vocabulary set D, there must be and only one path p from the root node to the node k in the Huffman tree k On the path is l k A node, l k -1 branch. And regarding each binary branch as a classification process, defining the node with the Huffman code of 0 as a positive class, and regarding the node with the Huffman code of 1 as a negative class. Because the nodes are provided with parameter vectors, in each two-classification process, the probability of the nodes is divided into positive classes and negative classes, namely the probability of the nodes is divided into the formula (3) and the formula (4):
Figure BDA0002193297080000061
Figure BDA0002193297080000062
wherein, σ (-) represents a distribution function, θ represents a vector formed by parameter vectors of the first j-1 nodes, and e represents a natural base number.
The overall probability is obtained by multiplying all the branch probabilities, as shown in equation (5):
Figure BDA0002193297080000063
substituting the formula (5) into a maximum likelihood function, calculating the gradient by adopting random gradient rise, and in the network, updating the corresponding vector in the matrix V and the related path in the Huffman tree
Figure BDA0002193297080000064
And (4) parameters. When the current word k is a polysemous word, the corresponding word sense vector C (k, h) also needs to be updated.
Referring to fig. 3 and 4, the implementation steps of the present invention include the following steps:
step 1: preprocessing text data and initializing a model.
In this embodiment, the text data is subjected to word segmentation processing and then used as training data T.
Then, firstly, an empty model object is established, and parameters such as a sample window size W, a lowest word frequency F, a generated word vector length and a training round are set.
Initializing a built-in dictionary of the model after the model is built: and generating a dictionary D according to the words appearing in the training data T and the frequency of the words, discarding the words with the word frequency lower than that of F in the dictionary D, and sequencing the D according to the word frequency. And then generating a built-in word vector matrix V according to the dictionary D, and generating a Huffman tree structure so as to accelerate the training process by utilizing a Hierachy Softmax algorithm, wherein leaf nodes of the Huffman tree are words in the dictionary D, and non-leaf nodes are used as network parameters. And finally, carrying out random initialization on the word vector matrix V and the network parameter values.
After the initialization is finished, the training process for T is started. Single sample T in T i Can be represented as t i = (k, H (k), g) wherein k is the current word, H: (c), (d) and (d)k) Representing k as the set of ambient words under the window W, and g representing the word to be predicted.
Step 2: forward propagation and real-time learning.
In the training process, for a single sample T in T i = (k, H (k), g) calculated layer by layer as three layer neural network structure: firstly, in an input layer, obtaining a corresponding vector V (k) of a current word k in a word vector matrix V, then, in a projection layer, judging k, if k is a polysemous word, a corresponding environment vector h thereof can be represented as formula (6):
Figure BDA0002193297080000071
after the environment vector is calculated, a word sense recognition algorithm based on real-time learning is called to obtain a sense item vector C (k, H) = getCenter (k, H) corresponding to k under H (k), wherein C represents a word sense word vector set. Let projection layer output X (k) = C (k, h); if the current word is a univocal word, the projection layer outputs X (k) = V (k); and finally, calculating the probability of the word g under X (k) according to a Hierachy Softmax algorithm in an output layer.
And step 3: and updating the network parameters.
After one-time sample forward propagation is completed, the gradient is calculated by adopting random gradient rise and parameters in the network are updated, and in the network, corresponding vectors in the word vector matrix V which needs to be updated and related paths in the Huffman tree
Figure BDA0002193297080000072
And (3) parameters, when the current word k is a multi-meaning word, the corresponding word sense vector C (k, h) needs to be updated. Is in the model->
Figure BDA0002193297080000073
See formula (7) for updating V corresponding to the vector, see formula (8) for updating V corresponding to the vector, and see formula (9) for updating C (k), where τ represents the learning rate set by the model. Obtaining final word vector matrixes V and C through repeated iterative training and parameter updating of the sample set, wherein word vectors corresponding to the univocal words are stored in V, and word senses corresponding to the polysemous words are stored in CA word vector.
Figure BDA0002193297080000074
Figure BDA0002193297080000075
Figure BDA0002193297080000076
And 4, step 4: and generating a word sense vector.
In order to improve the quality of generating the sense vectors, after the training is finished, the clusters with less sample number in the clusters formed by clustering and the corresponding sense vectors are deleted, namely the related sense vectors C of the polysemous words k in the sense vector set C ki Checking its corresponding cluster center (cluster center) O ki If cluster center O is clustered ki If the number of corresponding clusters is less than the threshold value m, deleting the cluster and the word sense vector C ki Reducing the word vector generation error caused by clustering to obtain the final word sense vector set C, C ki Representing the word sense vector corresponding to the ambiguous word k.
After the word sense vector model is established, the word sense representation of the polysemous words is more accurate due to the word vector model compared with the traditional word vector model, and the algorithm effect of related application based on word vectors can be further improved, for example, the method can be applied to the related fields of natural language processing such as word similarity calculation, word sense disambiguation, text classification and the like. The specific processing procedure for realizing the above three applications based on the word sense vector is as follows:
1) Calculating the similarity between words based on the word sense vector:
in the word sense vector model, each word is represented as one to more word sense vectors according to its number of senses. When the similarity between two words is calculated, the similarity can be converted into cosine similarity between vectors for measurement. Assuming that the vectors are a and b, the cosine similarity calculation formula therebetween is shown in formula (10):
Figure BDA0002193297080000081
where n represents the dimensions of vectors a and b, a i 、b i Representing the vector elements of a and b, respectively.
Assuming that the words to be compared are c and d, because the polysemous words have a plurality of word sense vectors, when the similarity between words is calculated, the cosine similarity between each word sense vector in the word c and each word sense vector in the word d needs to be calculated, and finally the maximum cosine similarity among the output results is taken as the similarity between the word c and the word d.
2) Word sense disambiguation based on word sense vectors:
when disambiguating the ambiguous word, the word sense vector model can be used to determine, first, the context C = { C } of the ambiguous word to be disambiguated is obtained 1 ,c 2 ,…,c n In which c is i Representing words in a context window; subsequently, c is read i Corresponding word sense vector e in the word sense vector model i And summing and averaging the two vectors to obtain a context environment vector h, namely a formula (11):
Figure BDA0002193297080000082
/>
and finally, calculating the cosine similarity between the context environment vector h and the word sense vector of the polysemous word to be disambiguated, and outputting the word sense corresponding to the word sense vector with the maximum cosine similarity to finish the disambiguation.
3) Text classification based on word sense vectors:
because the ambiguous word has only unique semantics in a specific context, if the corresponding semantics of the ambiguous word in the context are defined, the accuracy of text classification can be effectively improved.
Firstly, carrying out (11) disambiguation based on word sense vectors on the polysemous words in the text to be classified to obtain a word sense vector corresponding to each polysemous word; then, reading a corresponding vector of the word sense of the word in the text in the word sense vector model and the word sense vector of the polysemous word obtained in the previous step, and accumulating to obtain a vector w, wherein the vector w is a text vector; and finally, establishing a proper classifier, and training the classifier by using the text vector set to obtain a trained classification model.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims (2)

1. The fusion type word sense embedding method based on real-time learning is characterized by comprising the following steps of:
step 1: setting a neural network language model;
the network structure of the neural network language model comprises an input layer and a projection layer;
the input layer is used for acquiring a corresponding vector V (k) of a current word k in a preset word vector matrix V;
the projection layer is used for judging the current word k, if the current word k is a univocal word, the projection layer performs identity projection, and the projection layer outputs X (k) = V (k); if the word meaning is a polysemous word, acquiring a corresponding word meaning vector C (k, h) through a getCenter (k, h) based on a word meaning recognition algorithm learned in real time, and outputting X (k) = C (k, h) by a projection layer, wherein h represents an environment vector of the current word k;
step 2: performing neural network learning training on the neural network language model constructed in the step 1 based on a preset training sample set; when the preset training requirement is met, stopping and storing the trained neural network language model;
and 3, step 3: inputting the word to be subjected to word sense embedding processing into a trained neural network language model, and outputting the word sense vector of the word to be subjected to word sense embedding processing based on the projection of the word;
the real-time learning-based word sense recognition algorithm getCenter (k, h) comprises the following specific processing procedures:
judging whether a cluster center corresponding to the polysemous word k exists in a set O representing a cluster center set, if no corresponding cluster center exists, generating a new cluster center for the polysemous word k, and adding an environment vector h into the new cluster center;
if the clustering centers corresponding to the polysemous words k exist, the distances between the environment vectors h and the corresponding clustering centers are respectively calculated, the minimum value of the distances between the environment vectors h and the corresponding clustering centers is searched and is recorded as min (L), if the min (L) is smaller than a minimum distance threshold value delta, a new clustering center corresponding to the polysemous words k is generated, and the environment vectors h are added into the newly generated clustering centers; otherwise, the environment vector h is merged into the clustering center corresponding to min (L) to obtain a new clustering center
Figure FDA0003920113400000011
Wherein O is ki Representing the clustering center corresponding to min (L);
word sense vector C corresponding to clustering center based on environment vector h ki The sense vector C (k, h) is obtained.
2. The method as claimed in claim 1, wherein in the step 2, during training, an output layer is added to the neural network language model, a Huffman tree structure is adopted, words in a preset dictionary D are used as leaf nodes of the Huffman tree, and non-leaf nodes in the Huffman tree represent parameters of the neural network and are used for outputting the probability of the occurrence of the word g to be predicted under the output X (k) of the projection layer.
CN201910839702.1A 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning Expired - Fee Related CN110705274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910839702.1A CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910839702.1A CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Publications (2)

Publication Number Publication Date
CN110705274A CN110705274A (en) 2020-01-17
CN110705274B true CN110705274B (en) 2023-03-24

Family

ID=69194339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910839702.1A Expired - Fee Related CN110705274B (en) 2019-09-06 2019-09-06 Fusion type word meaning embedding method based on real-time learning

Country Status (1)

Country Link
CN (1) CN110705274B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783418B (en) * 2020-06-09 2024-04-05 北京北大软件工程股份有限公司 Chinese word meaning representation learning method and device
CN112989051B (en) * 2021-04-13 2021-09-10 北京世纪好未来教育科技有限公司 Text classification method, device, equipment and computer readable storage medium

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339322A (en) * 2011-11-10 2012-02-01 武汉大学 Word meaning extracting method based on search interactive information and user search intention
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105989125A (en) * 2015-02-16 2016-10-05 苏宁云商集团股份有限公司 Searching method and system for carrying out label identification on resultless word
CN106484685A (en) * 2016-10-21 2017-03-08 长沙市麓智信息科技有限公司 Patent real-time learning system and its learning method
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN106897950A (en) * 2017-01-16 2017-06-27 北京师范大学 One kind is based on word cognitive state Model suitability learning system and method
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN107656963A (en) * 2017-08-11 2018-02-02 百度在线网络技术(北京)有限公司 Vehicle owner identification method and device, computer equipment and computer-readable recording medium
CN108268449A (en) * 2018-02-10 2018-07-10 北京工业大学 A kind of text semantic label abstracting method based on lexical item cluster
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109271635A (en) * 2018-09-18 2019-01-25 中山大学 A kind of term vector improved method of insertion outside dictinary information
WO2019032307A1 (en) * 2017-08-07 2019-02-14 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN109859554A (en) * 2019-03-29 2019-06-07 上海乂学教育科技有限公司 Adaptive english vocabulary learning classification pushes away topic device and computer learning system
CN109960811A (en) * 2019-03-29 2019-07-02 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140129276A1 (en) * 2012-11-07 2014-05-08 Sirion Labs Method and system for supplier management
US10515400B2 (en) * 2016-09-08 2019-12-24 Adobe Inc. Learning vector-space representations of items for recommendations using word embedding models
US20180253638A1 (en) * 2017-03-02 2018-09-06 Accenture Global Solutions Limited Artificial Intelligence Digital Agent
US10474988B2 (en) * 2017-08-07 2019-11-12 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
US10678816B2 (en) * 2017-08-23 2020-06-09 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339322A (en) * 2011-11-10 2012-02-01 武汉大学 Word meaning extracting method based on search interactive information and user search intention
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN105989125A (en) * 2015-02-16 2016-10-05 苏宁云商集团股份有限公司 Searching method and system for carrying out label identification on resultless word
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN106484685A (en) * 2016-10-21 2017-03-08 长沙市麓智信息科技有限公司 Patent real-time learning system and its learning method
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN106897950A (en) * 2017-01-16 2017-06-27 北京师范大学 One kind is based on word cognitive state Model suitability learning system and method
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
WO2019032307A1 (en) * 2017-08-07 2019-02-14 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
CN107656963A (en) * 2017-08-11 2018-02-02 百度在线网络技术(北京)有限公司 Vehicle owner identification method and device, computer equipment and computer-readable recording medium
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN108268449A (en) * 2018-02-10 2018-07-10 北京工业大学 A kind of text semantic label abstracting method based on lexical item cluster
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109271635A (en) * 2018-09-18 2019-01-25 中山大学 A kind of term vector improved method of insertion outside dictinary information
CN109859554A (en) * 2019-03-29 2019-06-07 上海乂学教育科技有限公司 Adaptive english vocabulary learning classification pushes away topic device and computer learning system
CN109960811A (en) * 2019-03-29 2019-07-02 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于词义消歧的短文本情感分类方法研究";金保华 等;《现代计算机(专业版)》;20180715;第38-41页 *

Also Published As

Publication number Publication date
CN110705274A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN109635150B (en) Text generation method, device and storage medium
US20230039496A1 (en) Question-and-answer processing method, electronic device and computer readable medium
CN114298158A (en) Multi-mode pre-training method based on image-text linear combination
CN110263325B (en) Chinese word segmentation system
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN107273352B (en) Word embedding learning model based on Zolu function and training method
CN110597961A (en) Text category labeling method and device, electronic equipment and storage medium
CN117390497B (en) Category prediction method, device and equipment based on large language model
CN110705274B (en) Fusion type word meaning embedding method based on real-time learning
CN117217277A (en) Pre-training method, device, equipment, storage medium and product of language model
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN112287656A (en) Text comparison method, device, equipment and storage medium
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN115329075A (en) Text classification method based on distributed machine learning
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
CN110867225A (en) Character-level clinical concept extraction named entity recognition method and system
CN114626378B (en) Named entity recognition method, named entity recognition device, electronic equipment and computer readable storage medium
CN113705207A (en) Grammar error recognition method and device
CN117290478A (en) Knowledge graph question-answering method, device, equipment and storage medium
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN116070642A (en) Text emotion analysis method and related device based on expression embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230324

CF01 Termination of patent right due to non-payment of annual fee