CN106909537B

CN106909537B - One-word polysemous analysis method based on topic model and vector space

Info

Publication number: CN106909537B
Application number: CN201710067919.6A
Authority: CN
Inventors: 罗嘉文; 卓汉逵
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-02-07
Filing date: 2017-02-07
Publication date: 2020-04-07
Anticipated expiration: 2037-02-07
Also published as: CN106909537A

Abstract

The invention provides a word polysemous analysis method based on a topic model and a vector space, which comprises the following steps: s1, establishing a word-polysemous topic model by taking the formula (1) as an objective function; s2, reading the data of the whole document set D; s3, topic-word distribution

Initializing; s4, theme sampling; s5, updating the theme vector; s6, training word vectors; s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; s8, outputting and storing the obtained word vector and the obtained theme vector; s9, judging whether the word is ambiguous or not. The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.

Description

One-word polysemous analysis method based on topic model and vector space

Technical Field

The invention relates to the field of natural language processing, in particular to a word polysemous analysis method based on a topic model and a vector space.

Background

With the vigorous development of artificial intelligence technology, natural language processing is used as an innovative language research mode, combines computer science, linguistics and mathematics into an intelligent science, and is widely applied to the aspects of machine translation, question-answering systems, information retrieval, document processing and the like. Since most words do not have a meaning, that is, a phenomenon of word ambiguity exists, if each word is represented by a single word vector, the phenomenon of ambiguity cannot be eliminated, in order to solve the problem, context information or a topic vector is used for assisting in a word ambiguity study, but the study isolates a topic model, a word vector and a topic vector, and the existing result is simply used as prior knowledge to assist in training the model.

The topic model is used for mining hidden topic information of a document set, each topic represents a related concept and is embodied as a series of related words, and the realization form is topic-word distribution. The word vector model maps each word to a low-dimensional real-valued space by using context information in the text and contains information such as syntax semantics, so that the similarity of word vectors can be measured by using Euclidean distance or cosine included angle. The topic vector directly maps a topic into vector space, approximately representing the semantic center of a topic.

The topic model, the word vector and the topic vector can be used for document representation and are mainly applied to tasks such as document clustering and document classification. The three items have the characteristics in text mining, the combination of global information of a topic model and local information of a word vector is proved by research to be beneficial to improving the effect of an original model, but the research has great limitation, most of the research independently opens the three items, or one or two items of the three items are trained independently, and then the effect of the other item is improved by means of a training result; or directly using the training result of the larger training set as external knowledge to assist model training of other small data sets.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a word polysemous analysis method based on a topic model and a vector space by modeling a text document set and using the advantages of the topic model, a word vector and a topic vector for reference so as to better mine the hidden topic information of the document set.

In order to achieve the purpose, the invention adopts the following technical scheme:

a word-polysemous analysis method based on a topic model and a vector space comprises the following steps:

s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:

wherein

For a collection of text documents, M is the number of documents in the collection, N_mNumber of words for mth document, c size of context information window, w_m,nRepresenting the nth word of the mth document, K representing the number of topics, t_kWhich represents the k-th topic vector,

representing a topic-word distribution in the topic model,

denotes w_m,nThe subject number of (1);

s2, reading the whole document set

The data of (a);

s3, topic-word distribution

Initialization: firstly, a GibbsLDA algorithm is adopted to collect text documents

Subject sampling is performed on each word in the list; then, topic-word distribution to topic models

Carrying out initialization estimation;

s4, theme sampling: for each word w in the document_m,nCalculating the probability of each topic belonging to the word, and sampling the corresponding topic number z by adopting an accumulation distribution mode_m,n∈[1,K]；

S5, theme vector updating: for each topic vector t_k,k∈[1,K]The vector representation is recalculated according to equation (5):

wherein the content of the first and second substances,

to indicate a function, when x is true, the result is 1, otherwise it is 0.

The expression w_m,nCorresponding word vector representation, W represents the vocabulary size of the document collection, n_k,wThe number of words w assigned to the topic k;

s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;

s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;

s8, outputting and storing the obtained word vector and the obtained theme vector;

s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.

Further, in S3, the update rule used in the topic sampling process is as shown in equation (2):

where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documents

Number of words of (1), n_m,kRepresenting the number of words belonging to topic k in the mth document, z_m,nThe expression w_m,nThe assigned subject number is assigned to the subject,

the expression w_m,nNumber, n, assigned to topic k_k' denotes the number of all words assigned to subject k, α is the dirichlet symmetric hyperparameter;

topic-word distribution to topic models

The formula used for the initialization estimation is formula (3):

wherein the content of the first and second substances,

the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.

Further, in S4, the probability that the word belongs to each topic is calculated according to equation (4):

further, in S6, the method specifically includes the following steps:

s601, updating theme-word distribution

Is calculated according to equation (6)

The gradient of each component; for each component, defining its constraint as

Wherein, L (w)_m,n+j) Representing the distance w from the root node to the leaf node of the Huffman tree_m,n+jThe path length (number of nodes, including root and leaf nodes),

representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)^-x)，

Representing the ith non-leaf node on the path;

s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;

s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating a non-leaf node vector u on a Huffman tree path according to the formula (8) so that the non-leaf node vector u can influence the training quality of a word vector w;

further, in S9, the set threshold value is 0.6.

The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.

Drawings

Fig. 1 is a schematic flowchart of a word-polysemous analysis method based on a topic model and a vector space according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

In order to fully utilize the intrinsic characteristics of a topic model, a word vector and a topic vector, consider the universality of a word ambiguity phenomenon of text data and better mine the latent topic information of a document set and train a word vector and a topic vector with higher quality, the invention provides a word ambiguity analysis method based on the topic model and a vector space.

Specifically, the present invention makes the following reasonable assumptions according to the basic rules of natural language processing:

1. topic-word distribution in topic models

A series of words with higher probability can be used for representing a specific concept, the numerical meaning of the concept is the probability of a certain word appearing under the theme, and the quality of the mined theme can be evaluated through theme correlation.

2. Each word in the text can be mapped into a low-dimensional real-valued vector space, i.e., a word vector, which contains information such as the syntactic semantics of the word, and the differences between them can be evaluated using mathematical means such as euclidean distance or cosine.

3. Topic-word distribution in topic vectors and topic models

Rather than being completely isolated, the topic vector can be viewed as a semantic center mapping of the probability distribution in the word vector space, closely associated with the word vector.

Based on the above assumptions, the present invention provides a word-of-word polysemous analysis method based on a topic model and a vector space, as shown in fig. 1, the method includes the following steps:

wherein

representing a topic-word distribution in the topic model,

denotes w_m,nThe subject number of (1);

s2, reading the whole text document set

The data of (a);

s3, topic-word distribution

Carrying out initialization estimation;

the updating rule used in the theme sampling process is shown as formula (2):

the expression w_m,nNumber, n, assigned to topic k_k'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;

topic-word distribution to topic models

The formula used for the initialization estimation is formula (3):

wherein the content of the first and second substances,

a topic-word distribution representing an initialization estimate, β a dirichlet symmetric hyperparameter;

s4, theme sampling: for each word w in the document_m,nCalculating the probability of each topic belonging to the word according to the formula (4), and sampling the corresponding topic number z by adopting an accumulation distribution mode_m,n∈[1,K]；

In the topic model, z is an indispensable intermediate bridge in the Gibbs sampling solving process as an implicit variable of the model, and directly influences the topic-word distribution finally required to be acquired by the topic model

And the effect of the document-topic distribution theta. With the original Gibbs update gaugeIn contrast, the present invention employs formula (4) as a gibbs update rule, characterized in that the subject-word distribution is used directly in the update rule

The beneficial effect is that the distribution can be fully utilized

The statistical significance and the practical significance of the method can be improved, the calculation speed can be increased, and the method is more suitable for application of large-scale data sets.

wherein the content of the first and second substances,

to indicate a function, when x is true, the result is 1, otherwise it is 0.

The expression w_m,nCorresponding word vector representation, W represents the vocabulary size of the text document set, n_k,wIndicating the number of words w assigned to topic k, which is a subtle manifestation of word ambiguity, a word may belong to different topics. Furthermore, the calculation and updating of the theme vector can be carried out simultaneously without mutual interference.

The primary purpose of the topic vector is to use vector space to represent underlying topic information in a document collection rather than similarity

The polynomial distribution of (2) so that the theme has more spatial geometrical significance and is more closely combined with the word vector. Unlike the TWE model which trains topic vectors using a Skip-Gram like approach, the present invention employs equation (5) to directly update the computation cost perThe vector representation corresponding to each topic is characterized in that the vector representation of each topic is only related to words under the topic, and the method has the advantages that the calculation mode of the topic vector is simple, easy to understand, fast and efficient, the vector can be closer to the geometric center of the word vectors by adopting mean calculation, and the vector can be approximately regarded as the semantic center of a topic concept according to the hypothesis 2.

S6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode; specifically, S6 includes the steps of:

s601, updating theme-word distribution

Is calculated according to equation (6)

The gradient of each component; where it is noted that the distribution has a probabilistic meaning, we need to define for each component its constraint as

Representing the ith non-leaf node on the path;

s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree. The step is mainly to calculate a non-leaf node vector u on the path of the Huffman tree according to the formula (8) so as to influence the training quality of leaf nodes (namely word vectors w);

the computation complexity of the softmax function is linearly related to the size W of the vocabulary, so that the training of a large-scale data set is not facilitated. The invention continues to use the Skip-Gram approximate calculation method and adopts the idea of layered softmax to construct a Huffman tree, leaf nodes are each word w in the vocabulary, and non-leaf nodes u are used as auxiliary vectors u. In the word vector training stage, the invention adopts random gradient descent to solve the target function shown in the formula (1), and the distribution of the subject and the word is realized

Is characterized by a topic-word distribution as shown in equation (6)

Directly using the subject vector t_kAnd a non-leaf node vector u of the Huffman tree, which has the beneficial effect that the distribution of the subject-word is

Continuously absorbing topic vector t in iterative update_kAnd exchange information with the word vector is achieved by the auxiliary vector u, so that

The updating directly or indirectly utilizes the spatial characteristics of the theme vector and the word vector。

Further, the calculation of the update gradient of the node vector in the huffman tree is shown in the formula (7) and the formula (8), respectively, and is characterized in that the update of the non-leaf vector u directly uses the topic vector and the topic distribution

The updating of the leaf node w directly utilizes the non-leaf vector, and has the advantages that the theme vector and the theme distribution

By permeating non-leaf nodes (namely branches) of the whole Huffman tree, the leaf nodes on the Huffman tree are influenced by the leaf nodes and the branch nodes on a deeper level, so that the mutual promotion effect is achieved.

S7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; the iteration is mainly performed to further improve the topic model, the word vector and the topic vector by cross learning.

s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold (e.g. e <0.60), the word has a word ambiguity phenomenon; otherwise, the meaning of the word is consistent in different given contexts, and the word has no phenomenon of word ambiguity.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A word-polysemous analysis method based on a topic model and a vector space is characterized by comprising the following steps:

wherein

representing a topic-word distribution in a topic model, z_m,nDenotes w_m,nThe subject number of (1);

s2, reading the whole document set

The data of (a);

s3, topic-word distribution

Carrying out initialization estimation;

s4, theme sampling: for each word w in the document_m,nCalculating the word belonging to each topicProbability, then sampling the corresponding topic number z by adopting a cumulative distribution mode_m,n∈[1,K]；

wherein the content of the first and second substances,

to indicate a function, when x is true, the result is 1, otherwise it is 0,

2. The analysis method according to claim 1, wherein in S3, the update rule used in the topic sampling process is as shown in equation (2):

wherein- (m, n) represents removing the nth word of the mth document in statistics, and W represents the text document set

W denotes a collection of text documents

Vector representation of all terms in, n_m,kRepresenting the number of words belonging to topic k in the mth document, z_m,nThe expression w_m,nAssigned topic number, z^-(m,n)Representing a set of documents after the removal of the nth word from the mth document

A vector representation of all the topic numbers,

the expression w_m,nNumber, n, assigned to topic k_k'Representing the number of all words assigned to the topic k', α being the dirichlet symmetric hyperparameter;

topic-word distribution to topic models

The formula used for the initialization estimation is formula (3):

wherein the content of the first and second substances,

3. The analysis method according to claim 1, wherein in S4, the probability that the word belongs to each topic is calculated according to equation (4):

where- (m, n) denotes removing the nth word of the mth document at statistical time, w denotes the text document set

Vector representation of all words in, z^-(m,n)Representing a set of documents after the removal of the nth word from the mth document

A vector representation of all the topic numbers,

indicates the number of words belonging to topic k after the nth word is removed from the mth document,

represents the number of words belonging to the topic K' after the nth word is removed in the mth document, K represents the number of topics,

representing the topic-word distribution in the topic model, α is a dirichlet symmetric hyperparameter.

4. The analysis method according to claim 1, wherein in S6, the method specifically comprises the steps of:

s601, updating theme-word distribution

Is calculated according to equation (6)

The gradient of each component; for each component, defining its constraint as

Wherein the content of the first and second substances,

representing the distance w from the root node to the leaf node of the Huffman tree_m,n+jIncluding the root node and the leaf node,

Representing the ith non-leaf node on the path;

s602, updating word vectors

Calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;

s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating the auxiliary vector u of the non-leaf node of the Huffman tree according to the formula (8) to influence the word vector

The training quality of (2);

5. the analysis method according to claim 1, wherein in S9, the set threshold value is 0.6.