CN105608192A - Short text recommendation method for user-based biterm topic model - Google Patents

Short text recommendation method for user-based biterm topic model Download PDF

Info

Publication number
CN105608192A
CN105608192A CN201510979801.1A CN201510979801A CN105608192A CN 105608192 A CN105608192 A CN 105608192A CN 201510979801 A CN201510979801 A CN 201510979801A CN 105608192 A CN105608192 A CN 105608192A
Authority
CN
China
Prior art keywords
user
theme
word
short text
ubtm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510979801.1A
Other languages
Chinese (zh)
Inventor
吕建
徐锋
魏杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201510979801.1A priority Critical patent/CN105608192A/en
Publication of CN105608192A publication Critical patent/CN105608192A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a text topic analysis technology based short text recommendation method. Information forwarded or published by a user is subjected to topic analysis by utilizing a text topic model to obtain topic preferences of the user, and information meeting the user preferences is recommended from large amounts of unread information, so that the information overloading problem of a system is better solved. Based on a biterm topic model (BTM) and a short text based aggregation method, a new short text topic analysis-oriented topic model, namely, a user-based biterm topic model (UBTM), is proposed; and an experiment in a real data set from microblog shows that the UBTM can obtain a topic with higher quality in comparison with a conventional short text topic analysis method. A UBTM based short text recommendation experiment also shows that the short text recommendation method proposed by the invention has a better recommendation effect.

Description

A kind of short text recommend method based on the two word topic models of user
Technical field
The present invention relates to text and recommend, particularly lay particular emphasis on the recommendation of short text. On the basis of subject analysis technology, expandOpen up two word models, utilized the author information of text, effectively strengthened its theme extractability under short text sight, andImprove its precision of prediction in short text commending system.
Background technology
In recent years, along with the fast development of internet and Intelligent mobile equipment, with Twitter, the society that microblogging etc. are representativeHand over media application to become more and more welcome, personal website, blog, the application such as social network sites can emerge a large amount of letters every dayBreath, makes user be difficult to obtain effective information, has caused serious problem of information overload, and user is difficult to produce a large amount ofIn information, find own interested content. Text is recommended, and can recommend to meet it to user for the concrete condition of different userThe text message of hobby, becomes the effective way that solves problem of information overload.
One of core procedure that text is recommended is from text, to extract valuable feature, and subject analysis is a kind of commonFeature Extraction Method. Common subject analysis technology comprises, implicit semantic analysis (LSA) and implicit Dirichlet distribute(LDA) model, based on LDA model, has occurred that again multiple extended models are as sLDA, Labeled-LDA etc., and these class methods are all utilizedThe statistics that has arrived word co-occurrence is analyzed the theme distribution of text. These class methods are all taking the word in text as basic locatingReason unit, the word that contains in text packets is less, the theme of extraction is second-rate, and current a lot of social media asThe text message such as Twitter, microblogging is short text information, and above-mentioned subject analysis technology is more difficult to be extracted from these short textsHigh-quality theme distributes.
For this reason, also there is research work to propose a kind of couple of word topic model BTM based on LDA, attempt by co-occurrence between wordRelation, expands the word quantity of single short text, and a synthetic all document sets single large document is processed, this type ofMethod has improved the subject analysis quality of short text to a certain extent. But there is a comparatively significantly defect in these class methods, does not haveHave the author information of considering short text, the co-occurrence that only relies on two words in text is analyzed the theme of short text, due to lossThe information of outbalance, causes the difficult quality of subject analysis to meet the requirement that short text is recommended.
Summary of the invention
Goal of the invention: because traditional text subject analytical technology taking word as basic handling unit is difficult in high qualityExtract the theme feature of short text, make it be difficult to be applied to short text and recommend scene, short text recommended technology is to solveThe effective means of problem of information overload in current social media. For this reason, the present invention is based on two word topic models, further utilize shortThe author information of text, has proposed a kind of two word topic models based on user's polymerization, and has provided and a kind of take out based on this themeThe short text recommend method of delivery type, has solved the problems referred to above effectively.
Technical scheme: a kind of short text recommend method based on the two word topic models of user, for a kind of new for userTwo word short text subject analysis technology of text polymerization, and utilize this subject analysis technology to carry out the history text letter of analysis userCease, obtain user's subject matter preferences, realized a kind of short text commending system of personalization. The main contents of the method comprise:
1) build the two word short text topic model-UBTM based on user version polymerization;
2) UBTM model solution and the short text theme estimating method based on Gibbs sampling;
Build the two word short text topic model-UBTM based on user version polymerization:
Any two words in document are made into word pair, and the document that belongs to same user are condensed together,Provide a new probability graph model UBTM. This model efficiently solves the content Sparse Problems of short essay shelves, and can estimateThe theme distribution (preference) of unique user.
The customer documentation generative process of UBTM model is as follows:
According to above process, we can infer that a word of user u is to b=(wi,wj) joint probability distribution:
P ( b | u ) = Σ z P ( z | u ) P ( w i | z ) P ( w j | z ) = Σ z θ z | u φ i | z φ j | z
Wherein P (z|u) is user's subject matter preferences distribution, P (wi| z) with P (wj| z) be word right in two wordsLanguage wi,wjDistribution on theme.
UBTM model solution and short text theme estimating method based on Gibbs sampling:
Gibbs sampling algorithm is a special case of Markov-ChainMonteCarlo algorithm. The basic think of of this algorithmThink to choose a dimension of random vector at every turn, according to the current value of other dimensions, the value of this dimension of sampling, successively traversalAll random vectors. Constantly iteration, until convergence, interval sampling several times afterwards, calculate ginseng to be estimated according to statistical valueNumerical value. In the parametric inference process of UBTM, to be first each word under each user determine initial theme to random for we, soAfterwards according to conditional probability P (z|Z-b|u, B, α, β) and the new theme of the word of sampling out. (iteration is upgraded rule to the posterior probability of UBTM model) computing formula is as follows,
P ( z | Z - b | u , B , α , β ) ∝ n z | u + α n u + K α · ( n w i | z + β ) ( n w j | z + β ) ( Σ w n w | z + M β ) 2
Wherein M represents the sum of word, and K represents the sum of theme,
nz|uRepresent that all word centerings of user u are sampled to the total degree of theme z,
nw|zRepresent that single word w is sampled to the total degree of theme z,
nuRepresent that all words of user u are to all kinds of theme total degrees that are sampled to,
B represents that overall words pair set closes,
α and β are the super parameter of Dirichlet distribute,
The word distribution phi of themew|zTheme distribution θ with userz|uComputing formula is as follows:
φ w | z = n w | z + β Σ w n w | z + M β
θ z | u = n z | u + α n u + K α
Give on this basis by user's theme and distribute and infer the method that short text theme distributes.
The theme distribution of supposing short text is equal to the expectation that in short text, all words distribute to theme:
P ( z | d , u ) = Σ b P ( z | b ) P ( b | d , u )
Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to.
P (z|b) is by the φ calculating abovew|zz|uInfer and obtain according to Bayesian formula,
P ( z | b ) = P ( z | u ) P ( w i | z ) P ( w j | z ) Σ z P ( z | u ) P ( w i | z ) P ( w j | z )
Wherein P (z|u)=θz|u,P(wi|z)=φwi|z
P ( b | d , u ) = n b | d , u Σ b n b | d , u ,
Wherein nb|d,uRepresent that, in the short text d of user u, the total degree that word occurs b, can be obtained by statistics. And then bring intoThe derivation of equation obtains the theme distribution P (z|d, u) of short essay above.
Twitter at present, the social media application such as microblogging all can produce bulk information every day, causes serious information overloadProblem, user is difficult to find fast own interested content. And these information are again to present with the form of short text mostly,Traditional text subject model is as the theme distribution of very difficult this type of short text of accurate analysis of LDA. UBTM subject analysis technology is passed throughWord, to carrying out Topics Crawling with user's polymerization, is combined into word to condensing together by user's history text, carries out theme and dividesAnalyse, study user's subject matter preferences is inferred the not theme distribution of reading information of user simultaneously, calculates user's theme and distributes and do not readSimilarity between the theme of read message distributes, the text that similarity is high and user's interest approach, and this category information can be pushed awayRecommend to user, thereby form a short text commending system.
By the experiment on microblogging True Data collection, show the short text subject analysis technology energy that UBTM model is more traditionalObtain higher-quality theme. Forward and recommend, in experiment, also to have proved proposed by the invention at the microblogging based on UBTM modelShort text recommend method has better recommendation effect.
Brief description of the drawings
Fig. 1 is the graph model of the two word short text topic model-UBTM based on user version polymerization;
Fig. 2 is the microblogging short text commending system Organization Chart based on UBTM subject analysis technology.
Detailed description of the invention
Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only for the present invention is describedLimit the scope of the invention and be not used in, after having read the present invention, those skilled in the art are to various equivalences of the present inventionThe amendment of form all falls within the application's claims limited range.
Based on a short text recommend method for the two word topic models of user, main contents comprise:
1) build the two word short text topic model-UBTM based on user version polymerization;
2) UBTM model solution and the short text theme estimating method based on Gibbs sampling;
Taking microblogging as example, every microblogging has author, and we think between the microblogging that same position author is original or forwardShould have certain similitude, be relevant to author's interest preference. On the basis of this hypothesis, the present invention is based on classicsTwo word models, propose same user's word first to condensing together, and the method that different user is separately processed, is referred to asUBTM. Meanwhile, we have provided the parameter Estimation of UBTM model based on Gibbs sampling algorithm, and according to the master of the word of UBTMTopic distributes, and provides the theme estimating method of single short text. Finally, on the basis of the subject analysis of UBTM, recommend to userText, has realized a kind of short text commending system based on UBTM subject analysis technology.
It is generally acknowledged, one group of mutual related word can represent one " theme ", and association between these wordsProperty size, be to be determined by these words common number of times occurring in same section of text. Such as, " the Forbidden City ", " Tian An-men "," passenger flow ", " festivals or holidays " often can occur in same context (microblogging, first a newsflash), we can think himProbably belong to same implicit expression theme (sight spot, Beijing and the stream of people). Classical subject analysis method has been utilized this word exactlyThe co-occurrence of language, with this production process of modeling. Different from these class methods, it is right based on word that the present invention uses for referenceStatistics, has utilized user profile to be classified on this basis.
Here provide the right generative process of word: supposition is below the message delivered respectively of author A and B and after treatmentEffectively word.
Author Content of text Effectively word
A Russia's opportunity of combat is shot down by Turkey Opportunity of combat Turkey of Russia shoots down
B Spring transportation train ticket is sold on the 26th Spring transportation train ticket is sold on the 26th
A Putin responds: must revenge if shoot down again! Putin shoots down vengeance
B The permanent large record of horse cloud chartered plane group is made progress every day Horse cloud chartered plane perseverance is made progress every day greatly
Word is to exactly the effective word combination of two in each text being got up to expand short text, as follows
Classical two word models are that all words are become to a large document training to direct polymerization, above 21 wordsRight, form a large corpus. Think that all short texts defer to same theme and distribute, this hypothesis is also inapplicable allSituation, lack to short text personalized theme distribute consideration, subject analysis effect is limited. It is considered herein that a user coupleOne group of document of answering is expressed same group of theme, and by all words of a user, to combining, the word of different user is to obeyingDifferent themes distributes. In example by 9 words of user A to combining, 12 words of user B to composition together,Form 2 corpus.
The generative process concrete based on two word subject methods of user version polymerization is as follows:
According to above process, we can calculate a word under user u to b=(wi,wj) joint probability divideCloth:
P ( b | u ) = Σ z P ( z | u ) P ( w i | z ) P ( w j | z ) = Σ z θ z | u φ i | z φ j | z
The likelihood of user u is simultaneously:
P ( u ) = Π ( i . j ) P ( b | u )
Because we be by all words of a user to combining, think a user's the theme of all documentsDistribution is consistent. So both alleviate the Sparse Problem of classical topic model on short text, prevented again polymerization tooThe theme characteristic that many documents cause is lost problem.
Next, we provide the parameter deducing method of UBTM. The object that solves of UBTM model is mainly to estimate rationallyThe value of { φ, θ }, we adopt Gibbs to sample to be similar to deduction. The certainty that Gibbs sampling method is extensively approvedRandom algorithm, its sampling process approaches model production process, and consumes resources is low, is easy to carry out on large-scale data. ThisThe basic process of bright gibbs sampler is as described below:
The basic thought of gibbs sampler is to fix a parameter, replaces respectively another according to the conditional probability of surplus variableIndividual parameter, hockets. Being explained as follows of concrete sampling process:
1, first traveling through all words pair of all users, is theme of its Random assignment, zb|u~Multi (1/K), itsMiddle b represents word pair, and u represents user, zb|uThe theme that represents Random assignment, K represents theme sum, Multi (θ) represents taking θ as ginsengThe multinomial distribution of number.
2, respectively by nz|u、nunzAdd 1, wherein
nz|uFor the number of times of theme z appearance in the document of user u;
nuFor the summation of theme quantity in user u;
Represent two word w corresponding in theme zi,wjThe number of times occurring;
nzRepresent the total degree that theme z occurs;
3, following operation is carried out to iteration:
Travel through all users, all words pair of traversal user, the word of supposing active user u to b corresponding theme as z,First by nz|u、nunzSubtract respectively 1,, in the current word sampled result of last time, that samples according to theme is generalRate distribution P (z|Z-b|u, B, α, β) and the theme z that makes new advances of resampling, then by corresponding nz|u、nunzAdd respectively1, wherein Z-b|uRepresent active user u except this word of b to the right theme set of other words, B is whole words of all usersPair set,
P ( z | Z - b | u , B , α , β ) ∝ n z | u + α n u + K α · ( n w i | z + β ) ( n w j | z + β ) ( Σ w n w | z + M β ) 2 ,
Wherein M represents total words.
After completing, iteration just can infer the word distribution phi on theme zw|zTheme distribution θ with different userz|u
φ w | z = n w | z + β Σ w n w | z + M β
θ z | u = n z | u + α n u + K α
Make recommendation for short text list is carried out to personalized sequence, this patent gives UBTM on this basisThe estimating method that the theme of wall scroll text is distributed, we suppose that the theme distribution of wall scroll short text is equal in short text eachThe expectation that the right theme of word distributes,
P ( z | d , u ) = Σ b P ( z | b ) P ( b | d , u ) ,
Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to. P (z|b,U) by the φ calculating abovew|zz|uInfer according to Bayesian formula
P ( z | b , u ) = P ( z | u ) P ( w i | z ) P ( w j | z ) Σ z P ( z | u ) P ( w i | z ) P ( w j | z )
Wherein P (z|u)=θz|u,P(wi|z)=φwi|z
P ( b | d , u ) = n b | d , u Σ b n b | d , u ,
Wherein nb|d,uThe total degree that in the short text d of expression user u, word occurs b.
UBTM technology can be effective to the commending contents of the social media such as Twitter, microblogging. The social media of this class at presentAll can produce a large amount of short text information every day, causes serious problem of information overload, and user is difficult to navigate to fast oneself senseThe content of interest. The theme that we utilize UBTM to go out user from a large amount of short text learnings distributes, and infers make new advances shortThe theme of text distributes. In the messaging list of Twitter and microblogging, user can see what all good friends that he pays close attention to issuedMessage, is calculated the document subject matter distribution P (z|d, f) of the message that all his good friend f issue and user by method aboveThe theme distribution P (z|u) of u self, calculates cosine similarity:
P(z|d,f)=(a1,a2,a3…ak)
p(z|u)=(b1,b2,b3…bk)
c o s θ = Σ i = 1 k ( a i × b i ) Σ i = 1 k a i 2 × Σ i = 1 k b i 2
The scope of cosine similarity is between [1,1], and value more approaches 1, represents that two vectors are more approaching, and current microblogging moreApproach the preference of user u. We,, by recommending user u with distribute the most contiguous N bar microblogging of user u theme, have realized base with thisIn the short text commending system of UBTM subject analysis technology.
To compared with model, the present invention utilizes user to classify, and has proposed a kind of effective processing short text theme with classical wordMethod, and be used in microblogging recommend in, improve recommend the degree of accuracy.
Example one, the quantification evaluation and test of the subject analysis ability of UBTM of the present invention
1, inputoutput data explanation
We apply to method of the present invention in the data of anonymization of actual microblogging, and input is one group of microblogging data,Statistical conditions are as shown in table 1: data set has 101212 short texts, are divided into 738 groups according to different users, and average every group hasArticle 137.14, document, the word length of average each document is 29. As follows, enumerate several samples of data.
Several samples of short text data
Output is the subject analysis quality assessment index of UBTM topic model of the present invention.
2, model learning and parametric inference
First read all microbloggings and user corresponding to this microblogging, read one simultaneously and Chinese stop word list. ForEvery microblogging, falls the insignificant word that stops with stopping word list filtering, as " ", " you ", " what ", splits into list by remaining partWord, then taking a microblogging as unit, combinations of words forms word pair between two, and same user's word, to condensing together, is generatedWords pair set is closed Bu,u∈U。
According to previously described model learning and deduction process, by the Gibbs continuous iteration of sampling, study obtains microblogging societyThe distribution of corresponding word under the distribution of each user's theme and each theme in district.
3, Output rusults
Because theme itself might not have definite physical meaning, be also often called as implicit expression theme, so we useTwo kinds of modes are verified the result quality of subject analysis below.
First, we have manually contrasted the subject analysis effect of traditional LDA method and UBTM. We have found out two kinds of methodsThe high score vocabulary and the low point of vocabulary that inside represent separately same theme, the score value of word is to calculate by P (w|z), P (w|Z) be the probability that this word belongs to certain theme. In the ideal case, master should be able to be expressed very significantly in high score vocabularyTopic implication, and should also there is certain correlation with high score vocabulary in low point of vocabulary. In experiment, we have selected high frequency words " doctorInstitute ", and find out " hospital " corresponding related subject respectively under two kinds of methods, and then find high score under this theme and lowDivide vocabulary. As follows, provided " hospital " separately the height of the related subject in place divide vocabulary.
UBTM and LDA subject analysis model are in the qualitative comparison of theme
We can see, than traditional LDA subject analysis method, between the high score vocabulary of UBTM, obviously have more passConnection property, all relevant to " accident " " relief " " hospital ", and also there are similar relative words in low point of vocabulary. Compare andSpeech, the high score vocabulary of LDA has many and the incoherent word of theme, exists hardly and this theme is close in low point of vocabularyWord. Can prove thus, under short text situation, the subject analysis of UBTM is more effective than traditional LDA method.
In addition, we select Coherence score evaluation method to quantize more accurately the quality of theme quality. MainTopic z and its front T high score vocabulary V(z)=(v1 (z),...,vT (z)) the coherence score definition of (according to P (w|z) sequence)As follows:
C ( z ; V z ) = Σ m = 1 T Σ l = 1 m l o g D ( v m z , v l z ) + 1 D ( v l z )
Wherein D (v) is the frequency that in all kinds of documents, word v occurs, D (v, v') is the common number of times occurring of word v and v'.The larger explain principals quality of probability that the word of same subject occurs in same document is simultaneously higher. Coherence scoreCan think to weigh the one reference of theme vocabulary quality. Meanwhile, in order to reduce the impact of Gibbs randomness on result, withThe stability of the different models of Shi Hengliang, we have calculated Coherence score to the each theme under every kind of method, are averagedBe worth as final score:
1 k Σ k C ( z k ; V ( Z k ) )
Final result is as shown in table 2, and T is the number of the high score word chosen under a theme. Can from experimental resultFind, our method is consistent with anticipation, is better than classical LDA topic model. No matter be to choose front 5, front 10 or front20, Coherence score is all better than LDA. And the standard deviation of our method, lower than LDA, has shown me in the time of T=5 and T=10The stability of method also relatively better.
Example two, recommends the application evaluation and test under scene at microblogging
1, inputoutput data explanation
In this example, we apply to microblogging by subject analysis of the present invention and recommend in this practical application scene.We have chosen more than 7,000 microblogging that temperature is relatively high from the microblogging data of 6 months, observe more than 20,000 user to this 7The forwarding of more than thousand microblogging or 380,000 records that do not forward. Forwarding can be used as user and likes the fact of this microblogging to comply withAccording to, be object of experiment this time to forwarding that this behavior gives a forecast: we recommend microblogging to user according to UBTM, and according to useWhether family forwards to weigh accuracy rate and the recall rate of recommendation.
Article 380000, the selection rule of record is as follows: first, we are training set and test set according to time dividing data, rightIn each user, the microblogging that this user is forwarded is pressed Time alignment, and choose this user front 50% forwarding and record as training set,Remaining 50% microblogging does and forwards prediction as test set, and the theme that uses historical data to calculate user distributes, newThe microblogging to be predicted learning process that participating user theme does not distribute. Here need to indicate, we sentence according to the following rulesDetermine user and whether read certain microblogging: if user forwarded or sent microblogging at certain day, think that this is user's workJump day. The people that this user pays close attention to, the microblogging that sent or forwarded on active, is all considered as being read by this user. We pass throughUBTM calculates user's theme distribution and the theme of each microblogging distributes, and enlivening in a few days of needs prediction, good friend is sentMicroblogging sort according to the similarity distributing with this user's self theme, the forward microblogging of rank is probably with user'sPreference approaches, and is just recommended user, and in experiment, whether we forward whether really liking of authentication of users with this userThe microblogging that we recommend.
We recommend respectively Top3 to each user, 5,10 microblogging, thus calculate accuracy rate and recall rate, finally to instituteThere are user's accuracy rate and recall rate to take the mean, this result is compared with the result of LDA. Table 3 has specifically shown that this is realThe data of example.
2, the process that microblogging is recommended
2.1UBTM calculates user's theme and distributes
The theme distribution of supposing the microblogging of all transmissions of same user is consistent, will in training set, belong to same useFamily all original with forward microblogging be aggregated to together with, with method identical in example one calculate this user theme distribute withWord on theme distributes.
2.2 calculate that the theme of wall scroll microblogging in test set distributes
For every microblogging in test set, by its author's of trying to achieve before theme distribute and each theme on wordDistribution, extrapolate wall scroll microblogging theme distribute.
2.3 calculate similarity recommends microblogging
Calculate similar between the active wall scroll microblogging that in a few days its good friend delivers of user and the distribution of user self subject matter preferencesDegree (the cosine similarity between theme distribution vector), by sequencing of similarity, recommends Top3, and 5,10 microblogging is to user. Fig. 2 representsOur Organization Charts of the microblogging short text commending system based on UBTM subject analysis technology.
3, Output rusults
We have compared accuracy rate and the recall rate of the microblogging of classical LDA model and UBTM model of the present invention recommendation. NeedIllustrate, in actual applications, whether user forwards certain microblogging may be relevant to several factors, and this experiment is only consideredTopic Similarity is known the impact of no forwarding on user, so the size of the numerical value predicting the outcome itself does not have direct significance, but LDACan reflect that with the comparison of the prediction accuracy of UBTM under equal situation who has more excellent on short text recommendation field for theyGesture. Table 4 has represented this time concrete accuracy rate and the recall rate data of experiment, can see, is recommending Top3,5,10 to userWhen individual microblogging, UBTM compares traditional LDA method, and accuracy rate recall rate all improves. Prove that the present invention is accurateExtracted short text theme distribute, effectively predicted user microblogging forward behavior. And prove the theme based on UBTMThe microblogging short text commending system of analytical technology can reasonably be applied in reality.
The data statistics situation of table 1. example one
Microblogging number Short text document number 101212
Number of users Classification number 738
The average microblogging number of each user Average classification Chinese shelves number 137.14
Microblogging average length Short text average length 29
The contrast of table 2 example one CoherenceScore
The data statistics situation of table 3 example two
Number of users Microblogging quantity Forward quantity Do not forward quantity Record quantity
22296 7663 209945 177396 387341
Table 4LDA and UBTM model are in the experimental result forwarding in prediction

Claims (4)

1. the short text recommend method based on the two word topic models of user, is characterized in that, based on user version polymerizationTwo word short text subject analysis technology, and utilize this subject analysis technology to carry out the history text information of analysis user, obtain userSubject matter preferences, realized a kind of short text recommend method of personalization, specifically comprise:
1) build the two word short text topic model-UBTM based on user version polymerization;
2) UBTM model solution and the short text theme estimating method based on Gibbs sampling.
2. the short text recommend method based on the two word topic models of user as claimed in claim 1, is characterized in that, builds baseTwo word short text topic model-UBTM in user version polymerization:
Any two words in document are made into word pair, and the document that belongs to same user is condensed together, provideA new probability graph model UBTM;
Infer that a word of user u is to b=(wi,wj) joint probability distribution:
P ( b | u ) = Σ z P ( z | u ) P ( w i | z ) P ( w j | z ) = Σ z θ z | u φ i | z φ j | z .
3. the short text recommend method based on the two word topic models of user as claimed in claim 2, is characterized in that, based onUBTM model solution and the short text theme estimating method of Gibbs sampling:
In the parametric inference process of UBTM, being first each word under each user determines initial theme to random, then complies withAccording to conditional probability P (z|Z-b|u, B, α, β) and the new theme of the word of sampling out; Posterior probability (iteration update rule) meter of UBTM modelCalculation formula is as follows,
P ( z | Z - b | u , B , α , β ) ∝ n z | u + α n u + K α · ( n w i | z + β ) ( n w j | z + β ) ( Σ w n w | z + M β ) 2
Wherein M represents the sum of word, and K represents the sum of theme,
nz|uRepresent that all word centerings of user u are sampled to the total degree of theme z,
nw|zRepresent that single word w is sampled to the total degree of theme z,
nuRepresent that all words of user u are to all kinds of theme total degrees that are sampled to,
The word distribution phi of themew|zTheme distribution θ with userz|uComputing formula is as follows:
φ w | z = n w | z + β Σ w n w | z + M β
θ z | u = n z | u + α n u + K α .
4. the short text recommend method based on the two word topic models of user as claimed in claim 3, is characterized in that, by userTheme distributes and infers that the method that short text theme distributes is:
The theme distribution of supposing short text is equal to the expectation that in short text, all words distribute to theme:
P ( z | d , u ) = Σ b P ( z | b ) P ( b | d , u )
Wherein d represents short text, and b represents word pair, and z represents theme, and u represents which user this short essay belongs to;
P (z|b) is by the φ calculating abovew|zz|uInfer and obtain according to Bayesian formula,
P ( z | b ) = P ( z | u ) P ( w i | z ) P ( w j | z ) Σ z P ( z | u ) P ( w i | z ) P ( w j | z )
Wherein P (z|u)=θz|u P ( w i | z ) = φ w i | z ,
P ( b | d , u ) = n b | d , u Σ b n b | d , u ,
Wherein nb|d,uRepresent in the short text d of user u the total degree that word occurs b.
CN201510979801.1A 2015-12-23 2015-12-23 Short text recommendation method for user-based biterm topic model Pending CN105608192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510979801.1A CN105608192A (en) 2015-12-23 2015-12-23 Short text recommendation method for user-based biterm topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510979801.1A CN105608192A (en) 2015-12-23 2015-12-23 Short text recommendation method for user-based biterm topic model

Publications (1)

Publication Number Publication Date
CN105608192A true CN105608192A (en) 2016-05-25

Family

ID=55988131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510979801.1A Pending CN105608192A (en) 2015-12-23 2015-12-23 Short text recommendation method for user-based biterm topic model

Country Status (1)

Country Link
CN (1) CN105608192A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354818A (en) * 2016-08-30 2017-01-25 电子科技大学 Dynamic user attribute extraction method based on social media
CN106447387A (en) * 2016-08-31 2017-02-22 上海交通大学 Air ticket personalized recommendation method based on shared account passenger prediction
CN106484829A (en) * 2016-09-29 2017-03-08 中国国防科技信息中心 A kind of foundation of microblogging order models and microblogging diversity search method
CN106708802A (en) * 2016-12-20 2017-05-24 西南石油大学 Information recommendation method and system
CN106776579A (en) * 2017-01-19 2017-05-31 清华大学 The sampling accelerated method of Biterm topic models
CN106815214A (en) * 2016-12-30 2017-06-09 东软集团股份有限公司 optimal theme number calculating method and device
CN107357785A (en) * 2017-07-05 2017-11-17 浙江工商大学 Theme feature word abstracting method and system, feeling polarities determination methods and system
CN107506377A (en) * 2017-07-20 2017-12-22 南开大学 This generation system is painted in interaction based on commending system
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method
CN108536868A (en) * 2018-04-24 2018-09-14 北京慧闻科技发展有限公司 The data processing method of short text data and application on social networks
CN108763484A (en) * 2018-05-25 2018-11-06 南京大学 A kind of law article recommendation method based on LDA topic models
CN109766431A (en) * 2018-12-24 2019-05-17 同济大学 A kind of social networks short text recommended method based on meaning of a word topic model
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111611380A (en) * 2020-05-19 2020-09-01 北京邮电大学 Semantic search method, system and computer readable storage medium
CN115689089A (en) * 2022-10-25 2023-02-03 深圳市城市交通规划设计研究中心股份有限公司 Urban rail transit passenger riding probability calculation method, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIZHENG CHEN 等: "User Based Aggregation for Biterm Topic Model", 《ACL2015》 *
XIAOHUI YAN 等: "A Biterm Topic Model for Short Texts", 《INTERNATIONAL WORLD WIDE WEB CONFERENCE COMMITTE 2013》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354818B (en) * 2016-08-30 2020-01-10 电子科技大学 Social media-based dynamic user attribute extraction method
CN106354818A (en) * 2016-08-30 2017-01-25 电子科技大学 Dynamic user attribute extraction method based on social media
CN106447387A (en) * 2016-08-31 2017-02-22 上海交通大学 Air ticket personalized recommendation method based on shared account passenger prediction
CN106484829B (en) * 2016-09-29 2019-05-17 中国国防科技信息中心 A kind of foundation and microblogging diversity search method of microblogging order models
CN106484829A (en) * 2016-09-29 2017-03-08 中国国防科技信息中心 A kind of foundation of microblogging order models and microblogging diversity search method
CN106708802A (en) * 2016-12-20 2017-05-24 西南石油大学 Information recommendation method and system
CN106815214B (en) * 2016-12-30 2019-11-22 东软集团股份有限公司 Optimal number of topics acquisition methods and device
CN106815214A (en) * 2016-12-30 2017-06-09 东软集团股份有限公司 optimal theme number calculating method and device
CN106776579A (en) * 2017-01-19 2017-05-31 清华大学 The sampling accelerated method of Biterm topic models
CN106776579B (en) * 2017-01-19 2019-05-31 清华大学 The sampling accelerated method of Biterm topic model
CN107357785A (en) * 2017-07-05 2017-11-17 浙江工商大学 Theme feature word abstracting method and system, feeling polarities determination methods and system
CN107506377A (en) * 2017-07-20 2017-12-22 南开大学 This generation system is painted in interaction based on commending system
CN108182176B (en) * 2017-12-29 2021-08-10 太原理工大学 Method for enhancing semantic relevance and topic aggregation of topic words of BTM topic model
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method
CN108536868A (en) * 2018-04-24 2018-09-14 北京慧闻科技发展有限公司 The data processing method of short text data and application on social networks
CN108536868B (en) * 2018-04-24 2022-04-15 北京慧闻科技(集团)有限公司 Data processing method and device for short text data on social network
CN108763484A (en) * 2018-05-25 2018-11-06 南京大学 A kind of law article recommendation method based on LDA topic models
CN109766431A (en) * 2018-12-24 2019-05-17 同济大学 A kind of social networks short text recommended method based on meaning of a word topic model
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111611380A (en) * 2020-05-19 2020-09-01 北京邮电大学 Semantic search method, system and computer readable storage medium
CN111611380B (en) * 2020-05-19 2021-10-15 北京邮电大学 Semantic search method, system and computer readable storage medium
CN115689089A (en) * 2022-10-25 2023-02-03 深圳市城市交通规划设计研究中心股份有限公司 Urban rail transit passenger riding probability calculation method, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105608192A (en) Short text recommendation method for user-based biterm topic model
CN107644089B (en) Hot event extraction method based on network media
CN104077417A (en) Figure tag recommendation method and system in social network
Canhoto et al. ‘We (don’t) know how you feel’–a comparative study of automated vs. manual analysis of social media conversations
JP5754854B2 (en) Contributor analysis apparatus, program and method for analyzing poster profile information
Liu et al. Mining urban perceptions from social media data
Salem et al. Personality traits for egyptian twitter users dataset
CN112214661B (en) Emotional unstable user detection method for conventional video comments
Wahyudi et al. Aspect based sentiment analysis in E-commerce user reviews using Latent Dirichlet Allocation (LDA) and Sentiment Lexicon
Zong et al. Measuring forecasting skill from text
Widiyaningtyas et al. Sentiment Analysis Of Hotel Review Using N-Gram And Naive Bayes Methods
CN104572915A (en) User event relevance calculation method based on content environment enhancement
Thrane Modelling tourists’ length of stay: A call for a ‘back-to-basic’approach
Lassen et al. Reviewer Preferences and Gender Disparities in Aesthetic Judgments
JP6699031B2 (en) Model learning method, description evaluation method, and device
Paudel et al. Using personality traits information from social media for music recommendation
Stankevich et al. Analysis of Big Five Personality Traits by Processing of Social Media Users Activity Features.
CN107590742B (en) Behavior-based social network user attribute value inversion method
Alvarez-Carmona et al. A comparative analysis of distributional term representations for author profiling in social media
Zhou et al. Predicting user influence under the environment of big data
KR102502454B1 (en) Real-time comment judgment device and method using ultra-high-speed artificial analysis intelligence
Stankevich et al. Predicting personality traits from social network profiles
CN112487303B (en) Topic recommendation method based on social network user attributes
Liang et al. JST-RR model: joint modeling of ratings and reviews in sentiment-topic prediction
CN107203632A (en) Topic Popularity prediction method based on similarity relation and cooccurrence relation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160525