CN105608192A

CN105608192A - Short text recommendation method for user-based biterm topic model

Info

Publication number: CN105608192A
Application number: CN201510979801.1A
Authority: CN
Inventors: 吕建; 徐锋; 魏杰
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2015-12-23
Filing date: 2015-12-23
Publication date: 2016-05-25

Abstract

The invention discloses a text topic analysis technology based short text recommendation method. Information forwarded or published by a user is subjected to topic analysis by utilizing a text topic model to obtain topic preferences of the user, and information meeting the user preferences is recommended from large amounts of unread information, so that the information overloading problem of a system is better solved. Based on a biterm topic model (BTM) and a short text based aggregation method, a new short text topic analysis-oriented topic model, namely, a user-based biterm topic model (UBTM), is proposed; and an experiment in a real data set from microblog shows that the UBTM can obtain a topic with higher quality in comparison with a conventional short text topic analysis method. A UBTM based short text recommendation experiment also shows that the short text recommendation method proposed by the invention has a better recommendation effect.

Description

A kind of short text recommend method based on the two word topic models of user

Technical field

The present invention relates to text and recommend, particularly lay particular emphasis on the recommendation of short text. On the basis of subject analysis technology, expandOpen up two word models, utilized the author information of text, effectively strengthened its theme extractability under short text sight, andImprove its precision of prediction in short text commending system.

Background technology

In recent years, along with the fast development of internet and Intelligent mobile equipment, with Twitter, the society that microblogging etc. are representativeHand over media application to become more and more welcome, personal website, blog, the application such as social network sites can emerge a large amount of letters every dayBreath, makes user be difficult to obtain effective information, has caused serious problem of information overload, and user is difficult to produce a large amount ofIn information, find own interested content. Text is recommended, and can recommend to meet it to user for the concrete condition of different userThe text message of hobby, becomes the effective way that solves problem of information overload.

One of core procedure that text is recommended is from text, to extract valuable feature, and subject analysis is a kind of commonFeature Extraction Method. Common subject analysis technology comprises, implicit semantic analysis (LSA) and implicit Dirichlet distribute(LDA) model, based on LDA model, has occurred that again multiple extended models are as sLDA, Labeled-LDA etc., and these class methods are all utilizedThe statistics that has arrived word co-occurrence is analyzed the theme distribution of text. These class methods are all taking the word in text as basic locatingReason unit, the word that contains in text packets is less, the theme of extraction is second-rate, and current a lot of social media asThe text message such as Twitter, microblogging is short text information, and above-mentioned subject analysis technology is more difficult to be extracted from these short textsHigh-quality theme distributes.

For this reason, also there is research work to propose a kind of couple of word topic model BTM based on LDA, attempt by co-occurrence between wordRelation, expands the word quantity of single short text, and a synthetic all document sets single large document is processed, this type ofMethod has improved the subject analysis quality of short text to a certain extent. But there is a comparatively significantly defect in these class methods, does not haveHave the author information of considering short text, the co-occurrence that only relies on two words in text is analyzed the theme of short text, due to lossThe information of outbalance, causes the difficult quality of subject analysis to meet the requirement that short text is recommended.

Summary of the invention

Goal of the invention: because traditional text subject analytical technology taking word as basic handling unit is difficult in high qualityExtract the theme feature of short text, make it be difficult to be applied to short text and recommend scene, short text recommended technology is to solveThe effective means of problem of information overload in current social media. For this reason, the present invention is based on two word topic models, further utilize shortThe author information of text, has proposed a kind of two word topic models based on user's polymerization, and has provided and a kind of take out based on this themeThe short text recommend method of delivery type, has solved the problems referred to above effectively.

Technical scheme: a kind of short text recommend method based on the two word topic models of user, for a kind of new for userTwo word short text subject analysis technology of text polymerization, and utilize this subject analysis technology to carry out the history text letter of analysis userCease, obtain user's subject matter preferences, realized a kind of short text commending system of personalization. The main contents of the method comprise:

1) build the two word short text topic model-UBTM based on user version polymerization;

2) UBTM model solution and the short text theme estimating method based on Gibbs sampling;

Build the two word short text topic model-UBTM based on user version polymerization:

Any two words in document are made into word pair, and the document that belongs to same user are condensed together,Provide a new probability graph model UBTM. This model efficiently solves the content Sparse Problems of short essay shelves, and can estimateThe theme distribution (preference) of unique user.

The customer documentation generative process of UBTM model is as follows:

According to above process, we can infer that a word of user u is to b=(w_i,w_j) joint probability distribution:

\begin{matrix} P (b | u) = \underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z) \\ = \underset{z}{Σ} θ_{z | u} φ_{i | z} φ_{j | z} \end{matrix}

Wherein P (z|u) is user's subject matter preferences distribution, P (w_i| z) with P (w_j| z) be word right in two wordsLanguage w_i,w_jDistribution on theme.

UBTM model solution and short text theme estimating method based on Gibbs sampling:

Gibbs sampling algorithm is a special case of Markov-ChainMonteCarlo algorithm. The basic think of of this algorithmThink to choose a dimension of random vector at every turn, according to the current value of other dimensions, the value of this dimension of sampling, successively traversalAll random vectors. Constantly iteration, until convergence, interval sampling several times afterwards, calculate ginseng to be estimated according to statistical valueNumerical value. In the parametric inference process of UBTM, to be first each word under each user determine initial theme to random for we, soAfterwards according to conditional probability P (z|Z_-b|u, B, α, β) and the new theme of the word of sampling out. (iteration is upgraded rule to the posterior probability of UBTM model) computing formula is as follows,

P (z | Z_{- b | u}, B, α, β) &Proportional; \frac{n_{z | u} + α}{n_{u} + K α} \cdot \frac{(n_{w_{i} | z} + β) (n_{w_{j} | z} + β)}{{(\underset{w}{Σ} n_{w | z} + M β)}^{2}}

Wherein M represents the sum of word, and K represents the sum of theme,

n_z|uRepresent that all word centerings of user u are sampled to the total degree of theme z,

n_w|zRepresent that single word w is sampled to the total degree of theme z,

n_uRepresent that all words of user u are to all kinds of theme total degrees that are sampled to,

B represents that overall words pair set closes,

α and β are the super parameter of Dirichlet distribute,

The word distribution phi of theme_w|zTheme distribution θ with user_z|uComputing formula is as follows:

φ_{w | z} = \frac{n_{w | z} + β}{\underset{w}{Σ} n_{w | z} + M β}

θ_{z | u} = \frac{n_{z | u} + α}{n_{u} + K α}

Give on this basis by user's theme and distribute and infer the method that short text theme distributes.

The theme distribution of supposing short text is equal to the expectation that in short text, all words distribute to theme:

P (z | d, u) = \underset{b}{Σ} P (z | b) P (b | d, u)

Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to.

P (z|b) is by the φ calculating above_w|z,θ_z|uInfer and obtain according to Bayesian formula,

P (z | b) = \frac{P (z | u) P (w_{i} | z) P (w_{j} | z)}{\underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z)}

Wherein P (z|u)=θ_z|u，P(w_i|z)＝φ_wi|z，

P (b | d, u) = \frac{n_{b | d, u}}{\underset{b}{Σ} n_{b | d, u}},

Wherein n_b|d,uRepresent that, in the short text d of user u, the total degree that word occurs b, can be obtained by statistics. And then bring intoThe derivation of equation obtains the theme distribution P (z|d, u) of short essay above.

Twitter at present, the social media application such as microblogging all can produce bulk information every day, causes serious information overloadProblem, user is difficult to find fast own interested content. And these information are again to present with the form of short text mostly,Traditional text subject model is as the theme distribution of very difficult this type of short text of accurate analysis of LDA. UBTM subject analysis technology is passed throughWord, to carrying out Topics Crawling with user's polymerization, is combined into word to condensing together by user's history text, carries out theme and dividesAnalyse, study user's subject matter preferences is inferred the not theme distribution of reading information of user simultaneously, calculates user's theme and distributes and do not readSimilarity between the theme of read message distributes, the text that similarity is high and user's interest approach, and this category information can be pushed awayRecommend to user, thereby form a short text commending system.

By the experiment on microblogging True Data collection, show the short text subject analysis technology energy that UBTM model is more traditionalObtain higher-quality theme. Forward and recommend, in experiment, also to have proved proposed by the invention at the microblogging based on UBTM modelShort text recommend method has better recommendation effect.

Brief description of the drawings

Fig. 1 is the graph model of the two word short text topic model-UBTM based on user version polymerization;

Fig. 2 is the microblogging short text commending system Organization Chart based on UBTM subject analysis technology.

Detailed description of the invention

Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only for the present invention is describedLimit the scope of the invention and be not used in, after having read the present invention, those skilled in the art are to various equivalences of the present inventionThe amendment of form all falls within the application's claims limited range.

Based on a short text recommend method for the two word topic models of user, main contents comprise:

Taking microblogging as example, every microblogging has author, and we think between the microblogging that same position author is original or forwardShould have certain similitude, be relevant to author's interest preference. On the basis of this hypothesis, the present invention is based on classicsTwo word models, propose same user's word first to condensing together, and the method that different user is separately processed, is referred to asUBTM. Meanwhile, we have provided the parameter Estimation of UBTM model based on Gibbs sampling algorithm, and according to the master of the word of UBTMTopic distributes, and provides the theme estimating method of single short text. Finally, on the basis of the subject analysis of UBTM, recommend to userText, has realized a kind of short text commending system based on UBTM subject analysis technology.

It is generally acknowledged, one group of mutual related word can represent one " theme ", and association between these wordsProperty size, be to be determined by these words common number of times occurring in same section of text. Such as, " the Forbidden City ", " Tian An-men "," passenger flow ", " festivals or holidays " often can occur in same context (microblogging, first a newsflash), we can think himProbably belong to same implicit expression theme (sight spot, Beijing and the stream of people). Classical subject analysis method has been utilized this word exactlyThe co-occurrence of language, with this production process of modeling. Different from these class methods, it is right based on word that the present invention uses for referenceStatistics, has utilized user profile to be classified on this basis.

Here provide the right generative process of word: supposition is below the message delivered respectively of author A and B and after treatmentEffectively word.

Author	Content of text	Effectively word
			A	Russia's opportunity of combat is shot down by Turkey	Opportunity of combat Turkey of Russia shoots down
B	Spring transportation train ticket is sold on the 26th	Spring transportation train ticket is sold on the 26th
			A	Putin responds: must revenge if shoot down again!	Putin shoots down vengeance
B	The permanent large record of horse cloud chartered plane group is made progress every day	Horse cloud chartered plane perseverance is made progress every day greatly

Word is to exactly the effective word combination of two in each text being got up to expand short text, as follows

Classical two word models are that all words are become to a large document training to direct polymerization, above 21 wordsRight, form a large corpus. Think that all short texts defer to same theme and distribute, this hypothesis is also inapplicable allSituation, lack to short text personalized theme distribute consideration, subject analysis effect is limited. It is considered herein that a user coupleOne group of document of answering is expressed same group of theme, and by all words of a user, to combining, the word of different user is to obeyingDifferent themes distributes. In example by 9 words of user A to combining, 12 words of user B to composition together,Form 2 corpus.

The generative process concrete based on two word subject methods of user version polymerization is as follows:

According to above process, we can calculate a word under user u to b=(w_i,w_j) joint probability divideCloth:

\begin{matrix} P (b | u) = \underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z) \\ = \underset{z}{Σ} θ_{z | u} φ_{i | z} φ_{j | z} \end{matrix}

The likelihood of user u is simultaneously:

P (u) = \underset{(i . j)}{Π} P (b | u)

Because we be by all words of a user to combining, think a user's the theme of all documentsDistribution is consistent. So both alleviate the Sparse Problem of classical topic model on short text, prevented again polymerization tooThe theme characteristic that many documents cause is lost problem.

Next, we provide the parameter deducing method of UBTM. The object that solves of UBTM model is mainly to estimate rationallyThe value of { φ, θ }, we adopt Gibbs to sample to be similar to deduction. The certainty that Gibbs sampling method is extensively approvedRandom algorithm, its sampling process approaches model production process, and consumes resources is low, is easy to carry out on large-scale data. ThisThe basic process of bright gibbs sampler is as described below:

The basic thought of gibbs sampler is to fix a parameter, replaces respectively another according to the conditional probability of surplus variableIndividual parameter, hockets. Being explained as follows of concrete sampling process:

1, first traveling through all words pair of all users, is theme of its Random assignment, z_b|u～Multi (1/K), itsMiddle b represents word pair, and u represents user, z_b|uThe theme that represents Random assignment, K represents theme sum, Multi (θ) represents taking θ as ginsengThe multinomial distribution of number.

2, respectively by n_z|u、n_u、n_zAdd 1, wherein

n_z|uFor the number of times of theme z appearance in the document of user u;

n_uFor the summation of theme quantity in user u;

Represent two word w corresponding in theme z_i,w_jThe number of times occurring;

n_zRepresent the total degree that theme z occurs;

3, following operation is carried out to iteration:

Travel through all users, all words pair of traversal user, the word of supposing active user u to b corresponding theme as z,First by n_z|u、n_u、n_zSubtract respectively 1,, in the current word sampled result of last time, that samples according to theme is generalRate distribution P (z|Z_-b|u, B, α, β) and the theme z that makes new advances of resampling, then by corresponding n_z|u、n_u、n_zAdd respectively1, wherein Z_-b|uRepresent active user u except this word of b to the right theme set of other words, B is whole words of all usersPair set,

P (z | Z_{- b | u}, B, α, β) &Proportional; \frac{n_{z | u} + α}{n_{u} + K α} \cdot \frac{(n_{w_{i} | z} + β) (n_{w_{j} | z} + β)}{{(\underset{w}{Σ} n_{w | z} + M β)}^{2}},

Wherein M represents total words.

After completing, iteration just can infer the word distribution phi on theme z_w|zTheme distribution θ with different user_z|u

φ_{w | z} = \frac{n_{w | z} + β}{\underset{w}{Σ} n_{w | z} + M β}

θ_{z | u} = \frac{n_{z | u} + α}{n_{u} + K α}

Make recommendation for short text list is carried out to personalized sequence, this patent gives UBTM on this basisThe estimating method that the theme of wall scroll text is distributed, we suppose that the theme distribution of wall scroll short text is equal in short text eachThe expectation that the right theme of word distributes,

P (z | d, u) = \underset{b}{Σ} P (z | b) P (b | d, u),

Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to. P (z|b,U) by the φ calculating above_w|z,θ_z|uInfer according to Bayesian formula

P (z | b, u) = \frac{P (z | u) P (w_{i} | z) P (w_{j} | z)}{\underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z)}

Wherein P (z|u)=θ_z|u，P(w_i|z)＝φ_wi|z

P (b | d, u) = \frac{n_{b | d, u}}{\underset{b}{Σ} n_{b | d, u}},

Wherein n_b|d,uThe total degree that in the short text d of expression user u, word occurs b.

UBTM technology can be effective to the commending contents of the social media such as Twitter, microblogging. The social media of this class at presentAll can produce a large amount of short text information every day, causes serious problem of information overload, and user is difficult to navigate to fast oneself senseThe content of interest. The theme that we utilize UBTM to go out user from a large amount of short text learnings distributes, and infers make new advances shortThe theme of text distributes. In the messaging list of Twitter and microblogging, user can see what all good friends that he pays close attention to issuedMessage, is calculated the document subject matter distribution P (z|d, f) of the message that all his good friend f issue and user by method aboveThe theme distribution P (z|u) of u self, calculates cosine similarity:

P(z|d,f)＝(a₁,a₂,a₃…a_k)

p(z|u)＝(b₁,b₂,b₃…b_k)

c o s θ = \frac{Σ_{i = 1}^{k} (a_{i} \times b_{i})}{\sqrt{Σ_{i = 1}^{k} {a_{i}}^{2}} \times \sqrt{Σ_{i = 1}^{k} {b_{i}}^{2}}}

The scope of cosine similarity is between [1,1], and value more approaches 1, represents that two vectors are more approaching, and current microblogging moreApproach the preference of user u. We,, by recommending user u with distribute the most contiguous N bar microblogging of user u theme, have realized base with thisIn the short text commending system of UBTM subject analysis technology.

To compared with model, the present invention utilizes user to classify, and has proposed a kind of effective processing short text theme with classical wordMethod, and be used in microblogging recommend in, improve recommend the degree of accuracy.

Example one, the quantification evaluation and test of the subject analysis ability of UBTM of the present invention

1, inputoutput data explanation

We apply to method of the present invention in the data of anonymization of actual microblogging, and input is one group of microblogging data,Statistical conditions are as shown in table 1: data set has 101212 short texts, are divided into 738 groups according to different users, and average every group hasArticle 137.14, document, the word length of average each document is 29. As follows, enumerate several samples of data.

Several samples of short text data

Output is the subject analysis quality assessment index of UBTM topic model of the present invention.

2, model learning and parametric inference

First read all microbloggings and user corresponding to this microblogging, read one simultaneously and Chinese stop word list. ForEvery microblogging, falls the insignificant word that stops with stopping word list filtering, as " ", " you ", " what ", splits into list by remaining partWord, then taking a microblogging as unit, combinations of words forms word pair between two, and same user's word, to condensing together, is generatedWords pair set is closed B_u，u∈U。

According to previously described model learning and deduction process, by the Gibbs continuous iteration of sampling, study obtains microblogging societyThe distribution of corresponding word under the distribution of each user's theme and each theme in district.

3, Output rusults

Because theme itself might not have definite physical meaning, be also often called as implicit expression theme, so we useTwo kinds of modes are verified the result quality of subject analysis below.

First, we have manually contrasted the subject analysis effect of traditional LDA method and UBTM. We have found out two kinds of methodsThe high score vocabulary and the low point of vocabulary that inside represent separately same theme, the score value of word is to calculate by P (w|z), P (w|Z) be the probability that this word belongs to certain theme. In the ideal case, master should be able to be expressed very significantly in high score vocabularyTopic implication, and should also there is certain correlation with high score vocabulary in low point of vocabulary. In experiment, we have selected high frequency words " doctorInstitute ", and find out " hospital " corresponding related subject respectively under two kinds of methods, and then find high score under this theme and lowDivide vocabulary. As follows, provided " hospital " separately the height of the related subject in place divide vocabulary.

UBTM and LDA subject analysis model are in the qualitative comparison of theme

We can see, than traditional LDA subject analysis method, between the high score vocabulary of UBTM, obviously have more passConnection property, all relevant to " accident " " relief " " hospital ", and also there are similar relative words in low point of vocabulary. Compare andSpeech, the high score vocabulary of LDA has many and the incoherent word of theme, exists hardly and this theme is close in low point of vocabularyWord. Can prove thus, under short text situation, the subject analysis of UBTM is more effective than traditional LDA method.

In addition, we select Coherence score evaluation method to quantize more accurately the quality of theme quality. MainTopic z and its front T high score vocabulary V^(z)＝(v₁ ^(z),...,v_T ^(z)) the coherence score definition of (according to P (w|z) sequence)As follows:

C (z; V^{z}) = Σ_{m = 1}^{T} Σ_{l = 1}^{m} l o g \frac{D (v_{m}^{z}, v_{l}^{z}) + 1}{D (v_{l}^{z})}

Wherein D (v) is the frequency that in all kinds of documents, word v occurs, D (v, v') is the common number of times occurring of word v and v'.The larger explain principals quality of probability that the word of same subject occurs in same document is simultaneously higher. Coherence scoreCan think to weigh the one reference of theme vocabulary quality. Meanwhile, in order to reduce the impact of Gibbs randomness on result, withThe stability of the different models of Shi Hengliang, we have calculated Coherence score to the each theme under every kind of method, are averagedBe worth as final score:

\frac{1}{k} \underset{k}{Σ} C (z_{k}; V^{(Z_{k})})

Final result is as shown in table 2, and T is the number of the high score word chosen under a theme. Can from experimental resultFind, our method is consistent with anticipation, is better than classical LDA topic model. No matter be to choose front 5, front 10 or front20, Coherence score is all better than LDA. And the standard deviation of our method, lower than LDA, has shown me in the time of T=5 and T=10The stability of method also relatively better.

Example two, recommends the application evaluation and test under scene at microblogging

1, inputoutput data explanation

In this example, we apply to microblogging by subject analysis of the present invention and recommend in this practical application scene.We have chosen more than 7,000 microblogging that temperature is relatively high from the microblogging data of 6 months, observe more than 20,000 user to this 7The forwarding of more than thousand microblogging or 380,000 records that do not forward. Forwarding can be used as user and likes the fact of this microblogging to comply withAccording to, be object of experiment this time to forwarding that this behavior gives a forecast: we recommend microblogging to user according to UBTM, and according to useWhether family forwards to weigh accuracy rate and the recall rate of recommendation.

Article 380000, the selection rule of record is as follows: first, we are training set and test set according to time dividing data, rightIn each user, the microblogging that this user is forwarded is pressed Time alignment, and choose this user front 50% forwarding and record as training set,Remaining 50% microblogging does and forwards prediction as test set, and the theme that uses historical data to calculate user distributes, newThe microblogging to be predicted learning process that participating user theme does not distribute. Here need to indicate, we sentence according to the following rulesDetermine user and whether read certain microblogging: if user forwarded or sent microblogging at certain day, think that this is user's workJump day. The people that this user pays close attention to, the microblogging that sent or forwarded on active, is all considered as being read by this user. We pass throughUBTM calculates user's theme distribution and the theme of each microblogging distributes, and enlivening in a few days of needs prediction, good friend is sentMicroblogging sort according to the similarity distributing with this user's self theme, the forward microblogging of rank is probably with user'sPreference approaches, and is just recommended user, and in experiment, whether we forward whether really liking of authentication of users with this userThe microblogging that we recommend.

We recommend respectively Top3 to each user, 5,10 microblogging, thus calculate accuracy rate and recall rate, finally to instituteThere are user's accuracy rate and recall rate to take the mean, this result is compared with the result of LDA. Table 3 has specifically shown that this is realThe data of example.

2, the process that microblogging is recommended

2.1UBTM calculates user's theme and distributes

The theme distribution of supposing the microblogging of all transmissions of same user is consistent, will in training set, belong to same useFamily all original with forward microblogging be aggregated to together with, with method identical in example one calculate this user theme distribute withWord on theme distributes.

2.2 calculate that the theme of wall scroll microblogging in test set distributes

For every microblogging in test set, by its author's of trying to achieve before theme distribute and each theme on wordDistribution, extrapolate wall scroll microblogging theme distribute.

2.3 calculate similarity recommends microblogging

Calculate similar between the active wall scroll microblogging that in a few days its good friend delivers of user and the distribution of user self subject matter preferencesDegree (the cosine similarity between theme distribution vector), by sequencing of similarity, recommends Top3, and 5,10 microblogging is to user. Fig. 2 representsOur Organization Charts of the microblogging short text commending system based on UBTM subject analysis technology.

3, Output rusults

We have compared accuracy rate and the recall rate of the microblogging of classical LDA model and UBTM model of the present invention recommendation. NeedIllustrate, in actual applications, whether user forwards certain microblogging may be relevant to several factors, and this experiment is only consideredTopic Similarity is known the impact of no forwarding on user, so the size of the numerical value predicting the outcome itself does not have direct significance, but LDACan reflect that with the comparison of the prediction accuracy of UBTM under equal situation who has more excellent on short text recommendation field for theyGesture. Table 4 has represented this time concrete accuracy rate and the recall rate data of experiment, can see, is recommending Top3,5,10 to userWhen individual microblogging, UBTM compares traditional LDA method, and accuracy rate recall rate all improves. Prove that the present invention is accurateExtracted short text theme distribute, effectively predicted user microblogging forward behavior. And prove the theme based on UBTMThe microblogging short text commending system of analytical technology can reasonably be applied in reality.

The data statistics situation of table 1. example one

Microblogging number	Short text document number	101212
			Number of users	Classification number	738
The average microblogging number of each user	Average classification Chinese shelves number	137.14
			Microblogging average length	Short text average length	29

The contrast of table 2 example one CoherenceScore

The data statistics situation of table 3 example two

Number of users	Microblogging quantity	Forward quantity	Do not forward quantity	Record quantity
					22296	7663	209945	177396	387341

Table 4LDA and UBTM model are in the experimental result forwarding in prediction

Claims

1. the short text recommend method based on the two word topic models of user, is characterized in that, based on user version polymerizationTwo word short text subject analysis technology, and utilize this subject analysis technology to carry out the history text information of analysis user, obtain userSubject matter preferences, realized a kind of short text recommend method of personalization, specifically comprise:

2) UBTM model solution and the short text theme estimating method based on Gibbs sampling.

2. the short text recommend method based on the two word topic models of user as claimed in claim 1, is characterized in that, builds baseTwo word short text topic model-UBTM in user version polymerization:

Any two words in document are made into word pair, and the document that belongs to same user is condensed together, provideA new probability graph model UBTM;

Infer that a word of user u is to b=(w_i,w_j) joint probability distribution:

\begin{matrix} P (b | u) = \underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z) \\ = \underset{z}{Σ} θ_{z | u} φ_{i | z} φ_{j | z} \end{matrix} .

3. the short text recommend method based on the two word topic models of user as claimed in claim 2, is characterized in that, based onUBTM model solution and the short text theme estimating method of Gibbs sampling:

In the parametric inference process of UBTM, being first each word under each user determines initial theme to random, then complies withAccording to conditional probability P (z|Z_-b|u, B, α, β) and the new theme of the word of sampling out; Posterior probability (iteration update rule) meter of UBTM modelCalculation formula is as follows,

P (z | Z_{- b | u}, B, α, β) &Proportional; \frac{n_{z | u} + α}{n_{u} + K α} \cdot \frac{(n_{w_{i} | z} + β) (n_{w_{j} | z} + β)}{{(\underset{w}{Σ} n_{w | z} + M β)}^{2}}

Wherein M represents the sum of word, and K represents the sum of theme,

n_w|zRepresent that single word w is sampled to the total degree of theme z,

φ_{w | z} = \frac{n_{w | z} + β}{\underset{w}{Σ} n_{w | z} + M β}

θ_{z | u} = \frac{n_{z | u} + α}{n_{u} + K α} .

4. the short text recommend method based on the two word topic models of user as claimed in claim 3, is characterized in that, by userTheme distributes and infers that the method that short text theme distributes is:

P (z | d, u) = \underset{b}{Σ} P (z | b) P (b | d, u)

Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to;

P (z | b) = \frac{P (z | u) P (w_{i} | z) P (w_{j} | z)}{\underset{z}{Σ} P (z | u) P (w_{i} | z) P (w_{j} | z)}

Wherein P (z|u)=θ_z|u，

P (w_{i} | z) = φ_{w_{i} | z},

P (b | d, u) = \frac{n_{b | d, u}}{\underset{b}{Σ} n_{b | d, u}},

Wherein n_b|d,uRepresent in the short text d of user u the total degree that word occurs b.