CN106909537A

CN106909537A - A kind of polysemy analysis method based on topic model and vector space

Info

Publication number: CN106909537A
Application number: CN201710067919.6A
Authority: CN
Inventors: 罗嘉文; 卓汉逵
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-02-07
Filing date: 2017-02-07
Publication date: 2017-06-30
Anticipated expiration: 2037-02-07
Also published as: CN106909537B

Abstract

The invention provides a kind of polysemy analysis method based on topic model and vector space, including：S1, using formula (1) as object function, set up the topic model of polysemy；S2, the data for reading whole collection of document D；S3, descriptor distributionInitialization；S4, theme sampling；S5, theme vector update；S6, term vector training；S7, circulation perform S4 to S6 several times, to carry out iteration several times；S8, the term vector that will be drawn and theme vector are exported and stored；S9, judge whether polysemy.A kind of polysemy analysis method based on topic model and vector space that the present invention is provided, term vector, the theme vector of more high-quality can be trained, it is set to show more reasonably to explain in the researching and analysing of polysemy, and the performance of topic model is also significantly better than archetype LDA.The present invention learns mutually to improve by the intersection of topic model, term vector, theme vector this three, can be efficiently applied to the tasks such as similarity assessment, document classification, topic relativity.

Description

One-word polysemous analysis method based on topic model and vector space

Technical Field

The invention relates to the field of natural language processing, in particular to a word polysemous analysis method based on a topic model and a vector space.

Background

With the vigorous development of artificial intelligence technology, natural language processing is used as an innovative language research mode, combines computer science, linguistics and mathematics into an intelligent science, and is widely applied to the aspects of machine translation, question-answering systems, information retrieval, document processing and the like. Since most words do not have a meaning, that is, a phenomenon of word ambiguity exists, if each word is represented by a single word vector, the phenomenon of ambiguity cannot be eliminated, in order to solve the problem, context information or a topic vector is used for assisting in a word ambiguity study, but the study isolates a topic model, a word vector and a topic vector, and the existing result is simply used as prior knowledge to assist in training the model.

The topic model is used for mining hidden topic information of a document set, each topic represents a related concept and is embodied as a series of related words, and the realization form is topic-word distribution. The word vector model maps each word to a low-dimensional real-valued space by using context information in the text and contains information such as syntax semantics, so that the similarity of word vectors can be measured by using Euclidean distance or cosine included angle. The topic vector directly maps a topic into vector space, approximately representing the semantic center of a topic.

The topic model, the word vector and the topic vector can be used for document representation and are mainly applied to tasks such as document clustering and document classification. The three items have the characteristics in text mining, the combination of global information of a topic model and local information of a word vector is proved by research to be beneficial to improving the effect of an original model, but the research has great limitation, most of the research independently opens the three items, or one or two items of the three items are trained independently, and then the effect of the other item is improved by means of a training result; or directly using the training result of the larger training set as external knowledge to assist model training of other small data sets.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a word polysemous analysis method based on a topic model and a vector space by modeling a text document set and using the advantages of the topic model, a word vector and a topic vector for reference so as to better mine the hidden topic information of the document set.

In order to achieve the purpose, the invention adopts the following technical scheme:

a word-polysemous analysis method based on a topic model and a vector space comprises the following steps:

s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:

whereinFor a collection of text documents, M is the number of documents in the collection, N_mNumber of words for mth document, c size of context information window, w_m,nRepresenting the nth word of the mth document, K representing the number of topics, t_kWhich represents the k-th topic vector,representing a topic-word distribution in the topic model,denotes w_m,nThe subject number of (1);

s2, reading the whole document setThe data of (a);

s3, topic-word distributionInitialAnd (3) conversion: firstly, a GibbsLDA algorithm is adopted to collect text documentsSubject sampling is performed on each word in the list; then, topic-word distribution to topic modelsCarrying out initialization estimation;

s4, theme sampling: for each word w in the document_m,nCalculating the probability of each topic belonging to the word, and sampling the corresponding topic number z by adopting an accumulation distribution mode_m,n∈[1,K]；

S5, theme vector updating: for each topic vector t_k,k∈[1,K]The vector representation is recalculated according to equation (5):

wherein,to indicate a function, when x is true, the result is 1, otherwise it is 0.The expression w_m,nCorresponding word vector representation, W represents the vocabulary size of the document collection, n_k,wThe number of words w assigned to the topic k;

s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;

s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;

s8, outputting and storing the obtained word vector and the obtained theme vector;

s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.

Further, in S3, the update rule used in the topic sampling process is as shown in equation (2):

where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documentsNumber of words of (1), n_m,kRepresenting the number of words belonging to topic k in the mth document, z_m,nThe expression w_m,nThe assigned subject number is assigned to the subject,the expression w_m,nNumber, n, assigned to topic k_k' denotes the number of all words assigned to subject k, α is the dirichlet symmetric hyperparameter;

topic-word distribution to topic modelsThe formula used for the initialization estimation is formula (3):

wherein,the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.

Further, in S4, the probability that the word belongs to each topic is calculated according to equation (4):

further, in S6, the method specifically includes the following steps:

s601, updating theme-word distributionIs calculated according to equation (6)The gradient of each component; for each component, defining its constraint as

Wherein, L (w)_m,n+j) Representing the distance w from the root node to the leaf node of the Huffman tree_m,n+jThe path length (number of nodes, including root and leaf nodes),representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)^-x)，Representing the ith non-leaf node on the path;

s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;

s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating a non-leaf node vector u on a Huffman tree path according to the formula (8) so that the non-leaf node vector u can influence the training quality of a word vector w;

further, in S9, the set threshold value is 0.6.

The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.

Drawings

Fig. 1 is a schematic flowchart of a word-polysemous analysis method based on a topic model and a vector space according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

In order to fully utilize the intrinsic characteristics of a topic model, a word vector and a topic vector, consider the universality of a word ambiguity phenomenon of text data and better mine the latent topic information of a document set and train a word vector and a topic vector with higher quality, the invention provides a word ambiguity analysis method based on the topic model and a vector space.

Specifically, the present invention makes the following reasonable assumptions according to the basic rules of natural language processing:

1. topic-word distribution in topic modelsA series of words with higher probability can be used for representing a specific concept, the numerical meaning of the concept is the probability of a certain word appearing under the theme, and the quality of the mined theme can be evaluated through theme correlation.

2. Each word in the text can be mapped into a low-dimensional real-valued vector space, i.e., a word vector, which contains information such as the syntactic semantics of the word, and the differences between them can be evaluated using mathematical means such as euclidean distance or cosine.

3. Topic-word distribution in topic vectors and topic modelsRather than being completely isolated, the topic vector can be viewed as a semantic center mapping of the probability distribution in the word vector space, closely associated with the word vector.

Based on the above assumptions, the present invention provides a word-of-word polysemous analysis method based on a topic model and a vector space, as shown in fig. 1, the method includes the following steps:

s2, reading the whole text document setThe data of (a);

s3, topic-word distributionInitialization: firstly, a GibbsLDA algorithm is adopted to collect text documentsSubject sampling is performed on each word in the list; then, topic-word distribution to topic modelsCarrying out initialization estimation;

the updating rule used in the theme sampling process is shown as formula (2):

where- (m, n) indicates removal of the current word at the time of statistics, and W indicates the textDocument collectionNumber of words of (1), n_m,kRepresenting the number of words belonging to topic k in the mth document, z_m,nThe expression w_m,nThe assigned subject number is assigned to the subject,the expression w_m,nNumber, n, assigned to topic k_k'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;

wherein,a topic-word distribution representing an initialization estimate, β a dirichlet symmetric hyperparameter;

s4, theme sampling: for each word w in the document_m,nCalculating the probability of each topic belonging to the word according to the formula (4), and sampling the corresponding topic number z by adopting an accumulation distribution mode_m,n∈[1,K]；

In the topic model, z is an indispensable intermediate bridge in the Gibbs sampling solving process as an implicit variable of the model, and directly influences the topic-word distribution finally required to be acquired by the topic modelAnd the effect of the document-topic distribution theta. Unlike the original gibbs update rule, the present invention employs equation (4) as the gibbs update rule, which is characterized by directly using the subject-word distribution in the update ruleThe beneficial effect is that the distribution can be fully utilizedThe statistical significance and the practical significance of the method can be improved, the calculation speed can be increased, and the method is more suitable for application of large-scale data sets.

wherein,to indicate a function, when x is true, the result is 1, otherwise it is 0.The expression w_m,nCorresponding word vector representation, W represents the vocabulary size of the text document set, n_k,wIndicating the number of words w assigned to topic k, which is a subtle manifestation of word ambiguity, a word may belong to different topics. Furthermore, the calculation and updating of the theme vector can be carried out simultaneously without mutual interference.

The primary purpose of the topic vector is to use vector space to represent underlying topic information in a document collection rather than similarityThe polynomial distribution of (2) so that the theme has more spatial geometrical significance and is more closely combined with the word vector. Unlike the TWE model which trains the topic vector in a Skip-Gram-like manner, the vector representation corresponding to each topic is directly updated and calculated by the formula (5), and the method is characterized in that the vector representation of each topic is only related to words under the topic, and has the advantages that the topic vector is simple in calculation manner, easy to understand, fast and efficient, and the vector can be closer to the geometric center of the word vectors by mean calculation, and can be approximately regarded as the semantic center of a topic concept according to the hypothesis 2.

S6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode; specifically, S6 includes the steps of:

s601, updating theme-word distributionIs calculated according to equation (6)The gradient of each component; where it is noted that the distribution has a probabilistic meaning, we need to define for each component its constraint as

s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree. The step is mainly to calculate a non-leaf node vector u on the path of the Huffman tree according to the formula (8) so as to influence the training quality of leaf nodes (namely word vectors w);

the computation complexity of the softmax function is linearly related to the size W of the vocabulary, so that the training of a large-scale data set is not facilitated. The invention continues to use the Skip-Gram approximate calculation method and adopts the idea of layered softmax to construct a Huffman tree, leaf nodes are each word w in the vocabulary, and non-leaf nodes u are used as auxiliary vectors u. In the word vector training stage, the invention adopts random gradient descent to solve the target function shown in the formula (1), and the distribution of the subject and the word is realizedIs characterized by a topic-word distribution as shown in equation (6)Directly using the subject vector t_kAnd a non-leaf node vector u of the Huffman tree, which has the beneficial effects that the subject-word scoreClothContinuously absorbing topic vector t in iterative update_kAnd exchange information with the word vector is achieved by the auxiliary vector u, so thatThe updating directly or indirectly utilizes the spatial characteristics of the theme vector and the word vector.

Further, the calculation of the update gradient of the node vector in the huffman tree is shown in the formula (7) and the formula (8), respectively, and is characterized in that the update of the non-leaf vector u directly uses the topic vector and the topic distributionThe updating of the leaf node w directly utilizes the non-leaf vector, and has the advantages that the theme vector and the theme distributionBy permeating non-leaf nodes (namely branches) of the whole Huffman tree, the leaf nodes on the Huffman tree are influenced by the leaf nodes and the branch nodes on a deeper level, so that the mutual promotion effect is achieved.

S7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; the iteration is mainly performed to further improve the topic model, the word vector and the topic vector by cross learning.

s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold (e.g. e <0.60), the word has a word ambiguity phenomenon; otherwise, the meaning of the word is consistent in different given contexts, and the word has no phenomenon of word ambiguity.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A word-polysemous analysis method based on a topic model and a vector space is characterized by comprising the following steps:

s2, reading the whole document setThe data of (a);

2. The analysis method according to claim 1, wherein in S3, the update rule used in the topic sampling process is as shown in equation (2):

p (z_{m, n} = k | z^{- (m . n)}, w) &Proportional; \frac{n_{k, w_{m, n}}^{- (m, n)} + β}{Σ_{w = 1}^{W} n_{k, w}^{- (m, n)} + W β} \cdot \frac{n_{m, k}^{- (m, n)} + α}{Σ_{k^{'} = 1}^{K} n_{k^{'}}^{- (m, n)} + K α} - - - (2);

where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documentsNumber of words of (1), n_m,kRepresenting the number of words belonging to topic k in the mth document, z_m,nThe expression w_m,nThe assigned subject number is assigned to the subject,the expression w_m,nNumber, n, assigned to topic k_k'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;

3. The analysis method according to claim 1, wherein in S4, the probability that the word belongs to each topic is calculated according to equation (4):

4. the analysis method according to claim 1, wherein in S6, the method specifically comprises the steps of:

Wherein,representing from a root node to a leaf node of a Huffman treePath length (number of nodes, including root node and leaf)A node),representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)^-x)，Representing the ith non-leaf node on the path;

5. the analysis method according to claim 1, wherein in S9, the set threshold value is 0.6.