CN108415901A

CN108415901A - A kind of short text topic model of word-based vector sum contextual information

Info

Publication number: CN108415901A
Application number: CN201810124600.7A
Authority: CN
Inventors: 梁文新; 冯然; 张宪超
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2018-08-17

Abstract

The invention discloses a kind of short text topic models of word-based vector sum contextual information, from the semantic relation extracted in term vector between word, the disadvantage of short text data word co-occurrence deficiency is compensated for by explicit this semantic relation of acquisition, semantic relation between word is further filtered by training set data, it is made to be more applicable for training dataset.Background theme is added in generating process, noise word in document is modeled by background theme.Model is solved using gibbs sampler method in model inference, and increase probability of the larger word of semantic dependency under related subject using the sampling policy of Generalized Wave Leah urn model during sampling, in this way so that the semantic consistency of word is greatly improved under theme.A series of experiments shows that method proposed by the present invention can largely improve the semantic consistency of theme, and a kind of new method is provided for the modeling of short text theme.

Description

A kind of short text topic model of word-based vector sum contextual information

Technical field

The invention belongs to natural language processing fields, are related to a kind of short text theme of word-based vector sum contextual information Model

Background technology

With the development of social networks, short text has become one of main path of internet information spreading.Short essay Contain abundant information in notebook data, it is very valuable that subject information is extracted from short text data.Probability topic mould Type is a kind of effective ways for concentrating extraction subject information from document data, and topic model is a kind of unsupervised learning method, The input of model is document data, exports the subject information to include in document data, each theme can be regarded as word Distribution, the higher word of probability of occurrence can reflect the semantic feature of this theme, such as " education " under the theme, " university ", The probability under a theme such as " student " words is higher, then what the theme was reflected is the theme of one " educational ".Theme Why model is effectively largely dependent upon the co-occurrence information of word, i.e. two words occur in same piece document Probability is higher, then the probability for belonging to a theme is bigger.The models such as classical topic model such as LDA and PLSA exist Preferable effect is achieved in large-scale data.

Since in short text data, for word co-occurrence than sparse, traditional topic model can not be effectively from short The theme of high quality is extracted in text, the semantic consistency for obtaining theme is not high.In order to concentrate extraction high from short text data The theme of quality, it is therefore desirable to be able to make full use of the feature of external knowledge and training data itself to obtain the semantic letter of word Breath makes up the deficiency of word co-occurrence information, and further application semantics information improves the semanteme one of theme during modeling Cause property.

Invention content

The present invention is on the basis of existing research, it is proposed that a kind of short text theme of word-based vector sum contextual information Model, influence caused by making up word co-occurrence deficiency using the semantic information of word improve semantic relevant word and exist The probability occurred under same subject.Meanwhile background theme is introduced in a model to capture noise word information, it can further carry The semantic consistency of word under high each theme.

Technical scheme of the present invention：

A kind of short text topic model of word-based vector sum contextual information, steps are as follows：

(1) the Semantic features extraction stage

First, training term vector is concentrated from large-scale data, is obtained in training set between two words according to term vector Semantic similarity further obtains the set of semantic related words for word in training set.

(2) semantic information filtration stage

Because term vector training from large-scale text data obtains, the semantic dependency between word simultaneously differs Surely it is suitable for training data, so the semantic dependency according to the information of training data between word is needed to carry out further mistake Filter.

(3) the generating process modelling phase

With reference to DMM models, the generating process of Definition Model.Assuming that every short essay shelves are only there are one theme, it is every in document A word is generated by the theme or a background theme.The instruction of one binary of each word associations in document becomes Amount illustrates that the word comes from a normal theme, if the value of the variable is 1, illustrates the word when the value of the variable is 0 From background theme, which is a background word.

(4) model parameter solves the stage

According to generating process, the hidden variable in model is sampled using gibbs sampler, the parameter of model can root It is found out according to maximum posterior estimation.Increase semantic phase using Generalized Wave Leah urn model (General Polya Urn model) The statistic occurred under the word same subject of pass, after carrying out maximum posterior estimation according to sample, semantic phase under each theme The word probability of occurrence of pass will increase, so the semantic consistency of theme can improve.

The beneficial effects of the present invention are, it is proposed that a kind of short text theme mould of word-based vector sum contextual information Type, effectively utilizes term vector and contextual information goes to obtain the semantic dependency between word to make up in short text data The insufficient defect of word co-occurrence.During model inference, the word with stronger semantic dependency is under related subject Probability can be increased simultaneously, improve the semantic consistency of theme to a certain extent.It is added in the generating process of model Background theme information, effectively captures the noise word information in document, can further increase the semantic consistency of theme. And compared with the short text topic model proposed in the recent period, the model of this paper is all promoted in efficiency and effect, not Robustness is presented in same data, a kind of new frame is provided for the modeling of short text theme.

Description of the drawings

Fig. 1 is that the probability graph model of the method for the present invention indicates.

Fig. 2 is the theme that is extracted on Amazon data set of the present invention as file characteristics, the F1 to classify to document Value.

Fig. 3 is the theme that is extracted on Amazon data set of the present invention as file characteristics, the standard classified to document True rate.

Fig. 4 is the theme that is extracted on network inquiry data set of the present invention as file characteristics, is classified to document F1 values.

Fig. 5 is the theme that is extracted on network inquiry data set of the present invention as file characteristics, is classified to document F1 values.

Specific implementation mode

Specific embodiment of the present invention is illustrated below, the starting point to further illustrate the present invention and corresponding Technical solution.

The present invention is a kind of short text topic model of word-based vector sum contextual information, and main purpose is desirable to model The subject information for automatically extracting high quality, this method can be concentrated to be divided into following 4 steps from short text data：

(1) semantic similarity between word is obtained：

Term vector is trained from the Open-Source Tools word2vec of wikipedia data focus utilization Google first, if it is English Training data, then need using English wikipedia data set training term vector, if Chinese training data, then need to use The term vector of Chinese wikipedia data set training Chinese.Herein by taking English training data as an example, the training data used is Google's comment data collection (Amazon Reviews) and network inquiry data set (Web Snippet).So we utilize The term vector of word2vec tools training English word in English wikipedia data, vectorial dimension are set as 300.

For the training data of model, it would be desirable to be pre-processed data so as to subsequent operation：First with The nltk natural language processings library of python carries out subordinate sentence to text, and then each sentence is segmented, and English text is come It says, the separator between space, that is, word.Progress one needs to filter out stop words in text, then filters out that frequency occur small In the word of 5 documents, the document that Document Length is less than 3 words is filtered out.It can be obtained after the treatment about training number According to word list V.

For word w_iAnd w_j, corresponding term vector is v_iAnd v_j, then the semantic similarity between word we define For：Cosine similarity i.e. between vector.For the list in word list V Word w, defining its semanteme related words set S (w) is：S (w)={ w_o|SR(w,w_o)>ε }, wherein the value of ε regarding data set and Fixed, range is [0,1] ε ∈.

(2) training data is used to filter semantic similarity information between word

In working before, term vector is taken as the topic model field that external knowledge is applied to.Term vector is typically Concentrate training to obtain from large-scale text data, it includes semantic information may be not particularly suited for training data, such as " bachelor " and " undergraduate " the two words may be there is no too big association in " family " this theme Property, thus our point of use mutual informations (Point Mutual Information) come the semantic similarity information between word into Row filtering, makes it be more suitable for short text training data.Given word w_iAnd w_j, then the PMI between word be defined as：

Wherein p (w_i,w_j) indicate w_iAnd w_jThe probability occurred jointly in same piece document, by Estimation obtains, whereinIndicate word w_iAnd w_jThe number of files occurred jointly, and | D | indicate training set total number of documents, p (w) it indicates the probability that word w occurs in collection of document, can go to estimate by the document frequency comprising the word, i.e.,Can be that word w definition set S (w) are again according to PMI：S (w)={ w_o|SR(w,w_o)>ε,PMI(w, w_o)≥η}.I.e. if two relevant words of semanteme in training data if relevance very little, largely this two A word is not semantic relevant in training data concentration.Wherein η ∈ (- ∞ ,+∞), specific value is depending on data set.

(3) generating process of Definition Model

Due to being directed to short text data, it is possible to use for reference DMM models and Twitter-LDA models go definition originally Generating process involved in method.In topic model, generating process refers to the flow for assuming to generate document.Assume initially that text For shelves collection there are K theme, every short essay shelves are only related to a theme, because holding the limited length of document, every short text Length is generally all within 100 words, so the hypothesis is relatively easy and rational.Assuming that a short essay shelves d is associated Theme is k, then in the document not all word due to theme k list that is related, such as can all occurring in most of documents Word, these words can be counted as background word, they keep sentence more complete or semantic meaning representation more cleans.So can It is responsible for generating background word to set a global background theme B.Assuming that each word w in document d relating subjects z, d is closed One binary indicator variable y of connection illustrates that the word illustrates that the word comes from from background theme B if y=0 if y=1 In theme z.Theme is to be generated from a multinomial distribution θ sampling, and it is to sample to obtain in the distribution of α Di Li Crays that θ, which is from parameter, 's.For theme k and background theme B, the multinomial distribution φ about word_kIt is to be adopted in being distributed from parameter for β Di Li Crays What sample obtained.Complete generating process is as follows：

A) sampling obtains the theme distribution of collection of document from the Di Li Crays distribution that parameter is α：θ~Dirichlet (α)

B) it for background theme, is sampled from the Di Li Crays distribution that parameter is β and obtains the multinomial distribution about word： φ_B~Dirichlet (β)

C) sampling obtains the distribution of binary indicator variable：ψ~Dirichlet (γ)

D) for each theme k, sampling obtains subject word distribution：φ_k~Dirichlet (β)

E) for every document d in document sets, sampling first obtains the theme z of the document_d~Multinomial (θ) samples a binary indicator variable y first for i-th of word in document d_d,i~Bernoulli (ψ), if y_d,i= 0, then the word is from theme z_dIt generates, i.e.,If y_d,i=1, then the word is from background theme It generates, i.e. w_d,i~Multinomial (φ_B), wherein w_d,iIndicate i-th of word in document d.

The corresponding probability graph model of above-mentioned generating process indicates as shown in Figure 1.We assume that the document of training data is all Generated according to the above process, so in the case where having obtained observational variable, need according to above-mentioned generating process and Observational variable carrys out the hidden variable of modulus type.

(4) model parameter solves

According to generating process, we can write out is about the likelihood function L of training data：

It needs to maximize likelihood function to acquire the parameter and hidden variable of model, but due between variable in above formula Coupling, accurate solve is impossible, so we use the hidden variable and ginseng in the method solving model of approximate solution Number, the method for more commonly used approximate solution has EM algorithms, variation EM, variation it is expected propagation and lucky cloth in probability graph model This sampling, we solve parameter using gibbs sampler method here, and this method solution is fairly simple, by abundant Sampling after can obtain globally optimal solution.Certainly there are one very important the reason is that, using gibbs sampler we During can the semantic relevant information between word being included into sampling.According to likelihood function, it would be desirable to which sampling is to hidden Variable y and z are sampled, and for hidden variable φ_Bφ_1,...Kθ ψ, we can be obtained by MAP estimation.

Given word w and its semanteme related words set S (w), since the word in set S (w) has relatively by force with w Semantic relation, so if probability of the word w at a theme z is larger, for the word arbitrarily in set S (w) w_o, the probability at theme z also should be bigger.Our adopting using Generalized Wave Leah urn model in order to reach this purpose Sample strategy.

Pohle Asia urn model is more classical one of model in statistics, and many problems can be attributed to Pohle Asia tank Submodel.In the urn model of simple Pohle Asia, there are multiple beads, each bead to be coated with one kind inside a jar Color extracts a bead, then simultaneously by the bead and a bead identical with the bead color in jar at random It puts back in jar.And Generalized Wave Leah model is extended on the basis of the model, i.e., one is randomly selected from jar Bead records the color of this ball, and the bead and certain amount and should of two colors are then put back into the jar The bead of other similar colors of color.In topic model, jar can be the theme with analogy, and bead can be with analogy Cheng Dan Word, it is common gibbs sampler process that simple Pohle Asia urn model is corresponding.And during this model solution, we make It is Generalized Wave Leah urn model, i.e., for word w, if it occurs once, not only increasing w at k in theme k Statistic, also to increase statistic of the word at theme k in S (w).I.e. during gibbs sampler, if single The theme of word w is set as k, thenMeanwhile for w_o∈ S (w),WhereinIndicate master Topic k is associated with the statistic of word w, andIt is defined as：

For document d, the sampling formula of theme z is：

Wherein n_kThe number of documents that the k that is the theme occurs,For statistics of the word w at theme k,Be the theme k about The statistic of all words,For occurrence numbers of the word w in document d, and subscript-d indicates calculating these statistics When, the information of document d is not considered.

For i-th of word of document d, the sampling formula of binary indicator variable y is：

Wherein n^Y=1Indicate the number of background word in document sets, similarly, n^Y=0Indicate of non-background word in collection of document Number,The number of word w, n are generated for background theme B in document sets_BFor background word number in document sets, and subscript-d, i are indicated When calculating ASSOCIATE STATISTICS amount, the information of i-th of word of document d is not considered.After fully sampling, word can be obtained Probability values of the w at theme k be：

Since the word of the distribution of relationship indicator variable and background theme is not distributed for we, so in practical applications It only needs to find out φ.

The evaluation method of topic model is a variety of in having, we will learn feature of the obtained theme as document, to document Classify, the quality of theme quality is judged with the accuracy rate of classification.For giving grader, if theme semantic consistency Higher, then the accuracy rate classified is higher, we use random forest as grader, are made with the accuracy rate of classification and F1 values For the measurement index of classification.Our setting model parameter ε=0.5 and η=0.0 in experiment, with Amazon comment data and net Network inquires training data of the data as model.In order to further illustrate the validity of this model, we by the model and other 5 The topic model of a common short essay this field is compared, and the number K of theme is set as { 20,40,60,80 }.As a result as schemed Shown in 2, Fig. 3, Fig. 4, Fig. 5.From the point of view of the classifying quality of data, our model can be obtained in most cases preferably As a result, illustrating that the theme quality of the model extraction in majority of case is preferable.We can be into from Fig. 4 and Fig. 5 One step observes the growth with theme number, which has certain robustness, can't be because of the growth for the number that is the theme And the quality of theme is made to have prodigious decline.It can be seen that word-based vector sum context proposed by the present invention from experiment effect The feasibility of the short text topic model of information.

It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to the present invention's Protection domain.

Claims

1. a kind of short text topic model of word-based vector sum contextual information, which is characterized in that effectively utilize term vector And contextual information obtains the semantic similarity between word, and by semantic similarity Information application to gibbs sampler In the process, increase the semantic consistency of theme：

(1) semantic similarity between word is obtained

The training term vector from wikipedia or Google's news, the vector for obtaining each word in training data indicates, with vector Between cosine similarity indicate the semantic dependency between two words；For word w_iAnd w_j, corresponding term vector is v_iAnd v_j, then the semantic similarity between word be defined as：For training The each word concentrated obtains the set S (w) of its semantic related words, is defined as：S (w)={ w_o|SR(w,w_o)>ε }, For the value of middle ε depending on data set, range is [0,1] ε ∈；

Term vector is obtained from large data concentration training, and semantic information wherein included may be not particularly suited for training number According in order to further include the feature of training data, point of use mutual information PMI carried out obtained semantic similarity information Filter, word w_iAnd w_jBetween PMI be defined as：

Wherein, p (w_i,w_j) indicate word w_iAnd w_jThe probability occurred jointly in same piece document, p (w) indicate word w in document The probability occurred in set is gone to estimate by the document frequency comprising the word；Redefining set S (w) according to PMI is：S(w) ={ w_o|SR(w,w_o)>ε,PMI(w,w_o) >=η }, wherein η ∈ (- ∞ ,+∞), specific value is depending on data set；

(3) generating process of Definition Model

Specifying in short collection of document has K theme and a background theme；One short essay shelves includes only a theme, a document In word both generated by a normal theme or generated by a background theme；Specifically generating process is：

A) sampling obtains the theme distribution of collection of document：θ~Dirichlet (α)；

B) sampling obtains the word distribution of background theme：φ_B~Dirichlet (β)；

C) sampling obtains the distribution of binary indicator variable：ψ~Dirichlet (γ)；

D) for each theme k, sampling obtains subject word distribution：φ_k~Dirichlet (β)；

E) for every document d in document sets, sampling first obtains the theme of the document；

z_d~Multinomial (θ) samples a binary variable first for i-th of word in document d

y_d,i~Bernoulli (ψ), if y_d,i=0, then the word is from theme z_dIt generates, i.e.,

w_d,i~Multinomial (φ_zd), if y_d,i=1, then the word is from background theme B generations, i.e.,

w_d,i~Multinomial (φ_B)；

(4) model parameter solves

The method that the solution of model parameter is used is gibbs sampler, after maximum of the sample obtained according to sampling to carry out parameter Test estimation；In order to increase language using the method for sampling of General Polya Urn models in the semantic consistency for improving theme Statistic of the adopted higher word of similarity under related subject, i.e., when the corresponding themes of word w are assigned k, thenMeanwhile for w_o∈ S (w),WhereinIt is defined as：

According to generating process, the hidden variable that is sampled is z and y, and hidden variable φ_B、φ_1,...,k, θ and ψ estimated by maximum a posteriori Meter obtains；For document d, the sampling formula of theme z is：

Wherein α, β are the hyper parameter in Dirichlet distributions, and V is the size of word list,For statistics of the word w at theme k Amount,Be the theme statistics of the k about all words, n_kBased on entitled k text number, subscript-d indicates calculating current system When metering, document d will not be considered into, obtain sample z and obtain each theme about word by MAP estimation later Distribution：