CN107613520A - A kind of telecommunication user similarity based on LDA topic models finds method - Google Patents

A kind of telecommunication user similarity based on LDA topic models finds method Download PDF

Info

Publication number
CN107613520A
CN107613520A CN201710756540.6A CN201710756540A CN107613520A CN 107613520 A CN107613520 A CN 107613520A CN 201710756540 A CN201710756540 A CN 201710756540A CN 107613520 A CN107613520 A CN 107613520A
Authority
CN
China
Prior art keywords
user
represent
theme
document
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710756540.6A
Other languages
Chinese (zh)
Other versions
CN107613520B (en
Inventor
解绍词
吴新凯
徐光侠
刘宴兵
程金伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710756540.6A priority Critical patent/CN107613520B/en
Publication of CN107613520A publication Critical patent/CN107613520A/en
Application granted granted Critical
Publication of CN107613520B publication Critical patent/CN107613520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to Data Mining, specifically disclose a kind of telecommunication user similarity based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) topic model and find method, it is that the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically link together, from four it is different from the aspect of telecommunication user similarity calculating method, this four aspect be respectively:The positional information for all base stations that base attribute, message registration, short message record and the user of user connects in one day and connection initial time, end time.Emphasis of the present invention connects base station information corpus in one day to user using LDA topic models and is modeled, utilize the statistical property of text, excavate the potential subject information being hidden in text, obtain the theme distribution of document, the similarity of document is calculated with this, for it is deep excavation telecommunication field user similar features provide it is effectively guaranteed that.

Description

A kind of telecommunication user similarity based on LDA topic models finds method
Technical field
It is main based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) the present invention relates to one kind The telecommunication user similarity for inscribing model finds method, belongs to data mining, topic model field.
Background technology
In recent years, with the rise of mobile Internet industry, the scale of global telecommunication market is increasing, and technology innovation is got over Come faster, the competition between each operator and between operator and Internet firm is also growing more intense.Traditional talk business Fierce impact with short message service by social networks product under Internet firm, for this phenomenon, global telecommunications operation Business proposes Transformation Strategy one after another, and service strategy is customer-centric from being turned to centered on business.Therefore, operator must be more Add and in depth understand client, and then adjust migration efficiency, provide a user more quality services.
In the case where big data rises to the historical background of national strategy, plus the mass users data of the accumulation of operator for many years Under the conditions of, the potential value for fully excavating telecommunication user data not only all has for operator, or even social industry-by-industry Significance.
In order to reach above-mentioned target, it is one of research direction that corporations' division is carried out to telecommunication user network, and corporations An important step in division is exactly that similar users are clustered.Cluster is that analogical object is included into same cluster, different right As being grouped into different clusters.Due to the Data Distribution of macroscopic view can be established by cluster analysis, the phase between data attribute is understood Guan Du, and speculate correlation, so cluster is used widely in data mining.
Existing telecommunication user similarity calculating method, although it is contemplated that the characteristic attribute of user's relevant dimension, but not With reference to other features of mobile subscriber, such as mobile phone A pp service condition, browsing history, base station position information etc., therefore The Similarity value calculated has certain limitation, also influences the accuracy clustered afterwards indirectly.And LDA models are a kind of It is a kind of method that subject information to text data is modeled to the probability topic model of document sets modeling.It is by three layers Production bayesian network structure forms, based on such a hypotheses:Syntactic structure and word in document is ignored go out In the case of existing sequencing, document is made up of several implicit themes, and these themes are by several specific words Converge and form.Therefore, telecommunication user own base station positional information is abstracted as document, document subject matter is calculated using LDA topic models Between similarity, in conjunction with user's base attribute, call relation and short message be related to this three aspects content, consider the phase of user Like degree.
The content of the invention
To solve the deficiencies in the prior art, it is an object of the invention to propose a kind of telecommunications use based on LDA topic models Family similarity finds method, this method by the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically Link together, consider how to calculate telecommunication user similarity from four levelses, guarantee is provided for the accuracy of cluster.
In order to realize above-mentioned target, the embodiment of the present invention adopts the following technical scheme that, comprises the following steps:
S1:Gather user profile;
S2:The user profile gathered in S1 is pre-processed;
S3:Similarity is carried out to pre-processing the base attribute in information, user's communication record and user's short message record in S2 Calculate;
S4:Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate The information similarity;
S5:Comprehensive phase knowledge and magnanimity calculate, thus it is speculated that correlation;
S6:Clustered with the correlation deduced in S4.
The user profile gathered in S2 is pre-processed, including data scrubbing, data integration, data conversion, hough transformation 4 Individual step.
User's base attribute in S3, it is following 14 attributes, including:Whether spending amount, online duration, sex unknown, Sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100, Spending amount whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, disappear Take the amount of money whether between 500~1000, spending amount whether be more than 1000.
The base attribute of telecommunication user:Each user is abstracted into a characteristic vector, weighed with vectorial angle cosine value Measure the similitude of user's base attribute.Value is bigger, then the similar features in user's base attribute are more.
User's communication records:First, from duration of call angle, the duration of call depends not only on two users' Interworking Telephone Time measure, it is also contemplated that the call scenarios of the two users and neighboring user.Second, from talk times angle, it is assumed that identical In measurement period, user a and user b has carried out the call of one time 30 minutes, and user a and user c carries out the call of 6 times 5 minutes, Obvious user a and user c contacts even closer.Therefore, the relative duration of call is longer between two users, and talk times are more, similar Degree is higher.
User's short message records:Short message record is similar with message registration, but only considers the bar number of short message exchange between user, both sides Short message exchange bar number accounts for it and exchanges that the ratio of bar number is bigger, and similarity is higher with neighboring user.
User connected base station position information in one day:Some time was divided into by one day, according to user when different Between in section connection base station location tags, input of the transfer document in structure place as LDA topic models, obtain the theme of document Distribution, the similarity between document is calculated with this.
The similarity calculating method based on telecommunication user base attribute, formula are as follows:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent to The length of amount.
The calculating formula of similarity recorded based on telecommunication user message registration and short message is as follows:
Wherein, c represents the duration of call, and f represents voice frequency, behalf short message number.cijRepresent that user i initiates to user j The duration of call, cjiRepresent that user j initiates the duration of call, c to user iiRepresent user i and neighboring user (including user j) Call total duration, cjUser j and neighboring user (including user i) call total duration are represented, the implication of its dependent variable is with this Analogize.
The positional information that user connected base station in one day establishes LDA models, comprises the following steps before modeling:
(1) 4 kinds of labels are sticked for some regional base station location:It is home location base station (Home) respectively, job site Base station (Work), other base stations (Other), any connection request base station (No Reception) is not received.This 4 kinds of labels Implication is respectively:At user at home;User is in running order;User is from address and job site remote position;With Family mobile phone is in off-mode.
(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built Put describing mode:It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.
(3) to prevent the situation of over-fitting, the time describing mode of a coarseness is then built, one day is divided into 8 timeslices, it is respectively:0~6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, compile Number be 0~7.
(4) finally, place transfer corpus is built.A lexical item in corpus under some document includes continuous 2 hours Interior fine grained location label and a coarseness time tag, such as HHHHHH0, HWWWWW2 etc..
All documents in corpus are shifted according to given site, build LDA models.The document sets are by specifying user one Place change sequence is formed in it, and the lexical item collection is made up of 6 fine grained location labels and 1 coarseness time tag.LDA The generating process of model, comprises the following steps:
(1) select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir Representing the distribution of Di Li Crays, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document The parameter of Dirichlet distributions, also referred to as hyper parameter.
(2) select theme k lexical item probability distribution forWhereinK-th of theme matrix is represented, Dir represents the distribution of Di Li Crays, and k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme The parameter of Dirichlet distributions, also referred to as hyper parameter.
(3) for each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial Distribution;Select a lexical itemObey multinomial distribution.Wherein, wi,jRepresent i-th of text Lower j-th of the lexical item of shelves, zi,jRepresent the theme numbering of j-th of lexical item under i-th of document, θiI-th document is represented,Represent Theme zi,jDistribution.
The LDA models obtained according to said process, calculate joint probability distribution of some document based on hyper parameter:
Wherein ωmRepresent the vector that all words are formed in document m, zmRepresent the theme vector corresponding to document m, θmRepresent Document m theme probability distribution, φ represent the lexical item probability distribution of all themes, and α, β are the hyper parameters of Di Li Crays distribution, Nm Represent document m length, wm,nN-th of lexical item under m-th of document is represented,Represent theme zm,nDistribution, zm,nRepresent m The theme numbering of n-th of lexical item under individual document.
The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method Estimation, topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every articlek (d), each Lexical item probability distribution P (w | z=k)=φ under themew (k)
The theme probability distribution calculation formula of every article is as follows:
Wherein, θm,kRepresent k-th of theme of m piece articles, nm,kRepresent the number of kth theme, K tables occur in document m Show theme sum in m piece documents, α is the first parameter vector.
The calculation formula of lexical item probability distribution under each theme is as follows:
Wherein, φk,wRepresent the w words under k-th of theme, nk,wRepresent time that w lexical items occur under k-th of theme Number, V represent the sum of word under k-th of theme, and β is the second parameter vector.
According to the theme probability distribution of above-mentioned gained document, the distance of probability distribution variances between two document subject matters is calculated, Formula is as follows:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Get the general of theme i Rate,Represent document d2Get theme i probability.
A kind of telecommunication user similarity based on LDA topic models finds method, by the multidimensional characteristic and base of user Organically combined in the motif discovery algorithm (LDA) of probability, synthesis show that the calculating formula of similarity of user is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2 The weights calculated using message registration and short message record are represented, η is set2=0.3;η3Represent and connect base station in one day using user The weights of positional information calculation, η is set3=0.6, remaining parameter and it is consistent above.
Beneficial effect of the present invention:
1. introducing telecommunication user connects base station information, base station is divided into it is different classes of, to the position row in user one day To be modeled using LDA, similitude of the user in daily behavior is fully excavated.
2. introduce timeslice division user, from thick, thin two granularities portray user one day in daily habits.At utmost Avoid the generation of over-fitting.
3. the multidimensional characteristic of telecommunication user and the motif discovery algorithm (LDA) based on probabilistic model are organically contacted one Rise, consider telecommunication user similarity from four levelses, it is comprehensive and reasonable.
Brief description of the drawings
Fig. 1 is the similarity calculating method schematic diagram of the present invention.
Fig. 2 is the LDA topic model figures of the present invention.
Fig. 3 is the topological structure schematic diagram for the LDA models that the present invention uses.LDA models think that every document is by multiple Theme is mixed, and each theme is characterized by multiple lexical items.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with specific embodiment, to this Invention is further elaborated.It should be appreciated that particular embodiments described herein is only to explain the present invention, and do not have to It is of the invention in limiting.
A kind of telecommunication user similarity based on LDA topic models finds method, considers that user's is similar from four levelses Characteristic is spent, including:
The base attribute of telecommunication user:Each user is abstracted into a characteristic vector, weighed with vectorial angle cosine value Measure the similitude of user's base attribute.Value is bigger, then the similar features in user's base attribute are more.
User's communication records:First, from duration of call angle, the duration of call depends not only on two users' Interworking Telephone Time measure, it is also contemplated that the call scenarios of the two users and neighboring user.Second, from talk times angle, it is assumed that identical In measurement period, user a and user b has carried out the call of one time 30 minutes, and user a and user c carries out the call of 6 times 5 minutes, Obvious user a and user c contacts even closer.Therefore, the relative duration of call is longer between two users, and talk times are more, similar Degree is higher.
User's short message records:Short message record is similar with message registration, but only considers the bar number of short message exchange between user, both sides Short message exchange bar number accounts for it and exchanges that the ratio of bar number is bigger, and similarity is higher with neighboring user.
User connected base station position information in one day:Some time was divided into by 24 hours one day, is existed according to user The location tags of connection base station in different time sections, input of the transfer document in structure place as LDA topic models, obtain document Theme distribution, the similarity between document is calculated with this.
As shown in figure 1, a kind of telecommunication user similarity based on LDA topic models finds method, its required data Following steps are being had been subjected to using preceding, including:Data scrubbing, data integration, hough transformation, data conversion.
Next from the base attribute of the extracting data user after the completion of pretreatment, there are following 14:Spending amount, on Whether net duration, sex unknown, sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, consumption The amount of money whether between 0~100, spending amount whether between 100~200, whether spending amount between 200~300, spending amount Whether between 300~500, spending amount whether between 500~1000, spending amount whether be more than 1000.These attributes are By hough transformation and conversion, therefore user property can be abstracted as characteristic vector, similarity is calculated using equation below:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent to The length of amount.
Then the message registration and short message that user is extracted from data set record, the duration of call, call time between counting user Number and short message number, similarity is calculated using equation below:
Wherein, c represents the duration of call, and f represents voice frequency, behalf short message number.cijRepresent that user i initiates to user j The duration of call, cjiRepresent that user j initiates the duration of call, c to user iiRepresent user i and neighboring user (including user j) Call total duration, cjUser j and neighboring user (including user i) call total duration are represented, the implication of its dependent variable is with this Analogize.
Base station information is finally connected according to user and establishes LDA models, is specifically comprised the following steps:
(1) 4 kinds of labels are sticked for some regional base station location:It is home location base station (Home) respectively, job site Base station (Work), other base stations (Other), any connection request base station (No Reception) is not received.This 4 kinds of labels Implication is respectively:At user at home;User is in running order;User is from address and job site remote position;With Family mobile phone is in off-mode.
(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built Put describing mode:It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.To prevent The situation of fitting, the time describing mode of a coarseness is then built, was divided into 8 timeslices by one day, is respectively:0~ 6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7.
(3) place transfer corpus is built, a lexical item in corpus under some document includes thin in continuous 2 hours Granularity base station location label and a coarseness time tag, all documents in corpus, structure are shifted according to given site LDA models, as shown in Figure 2.
The generating process of LDA models, comprises the following steps:
(1) document i theme probability distribution is selectedWhereinRepresent i-th document matrix, Dir generations Table document i obeys the distribution of Di Li Crays, and i belongs to { 1 ... M }, and M is document number,It is hyper parameter.
(2) theme k lexical item probability distribution is selectedWhereinRepresent k-th of theme matrix, Dir Representing theme k and obey the distribution of Di Li Crays, k belongs to { 1 ... K }, and K is theme number,It is hyper parameter.
(3) for each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial Distribution;Select a lexical itemObey multinomial distribution.Wherein, wi,jRepresent i-th of text Lower j-th of the lexical item of shelves, zi,jRepresent the theme numbering of j-th of lexical item under i-th of document.θiI-th document is represented,Represent Theme zi,jDistribution.
The LDA models obtained according to said process, it can be found that LDA models have clearly hierarchical structure, such as Fig. 3 institutes Show, every document is mixed by multiple themes, and each theme is characterized by multiple lexical items.Thus some document is calculated to be based on The joint probability distribution of hyper parameter:
Wherein ωmRepresent the vector that all words are formed in document m, zmRepresent the theme vector corresponding to document m, θmRepresent Document m theme probability distribution, φ represent the lexical item probability distribution of all themes, and α, β are the hyper parameters of Di Li Crays distribution, Nm Represent document m length, wm,nN-th of lexical item under m-th of document is represented,Represent theme zm,nDistribution, zm,nRepresent m The theme numbering of n-th of lexical item under individual document.
The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method Estimation.Topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every articlek (d), each Lexical item probability distribution P (w | z=k)=φ under themew (k)
The theme probability distribution calculation formula of every article is as follows:
Wherein, θm,kRepresent k-th of theme of m piece articles, nm,kRepresent the number of kth theme, K tables occur in document m Show theme sum in m piece documents, α is the first parameter vector.
The calculation formula of lexical item probability distribution is as follows under each theme:
Wherein, φk,wRepresent the w words under k-th of theme, nk,wRepresent time that w lexical items occur under k-th of theme Number, V represent the sum of word under k-th of theme, and β is the second parameter vector.
Therefore, can be as follows according to the distance metric Documents Similarity of probability distribution between two document subject matters, calculation formula:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Get the general of theme i Rate,Represent document d2Get theme i probability.
Finally, the multidimensional characteristic of user and the motif discovery algorithm (LDA) based on probability are organically combined, synthesis draws use The calculating formula of similarity at family is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2 The weights calculated using message registration and short message record are represented, η is set2=0.3;η3Represent and connect base station in one day using user The weights of positional information calculation, η is set3=0.6, remaining parameter and it is consistent above.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of from which, example all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention will by appended right Ask rather than described above limits, it is intended that all changes in the implication and scope of the equivalency of claim will be fallen Include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims (10)

1. a kind of telecommunication user similarity based on LDA topic models finds method, it is characterised in that comprises the following steps:
S1:Gather user profile;
S2:The user profile gathered in S1 is pre-processed;
S3:Similarity is carried out respectively to pre-processing the base attribute in information, user's communication record and user's short message record in S2 Calculate;
S4:Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate the letter Cease similarity;
S5:Comprehensive S3 and S4 phase knowledge and magnanimity calculate, thus it is speculated that correlation;
S6:Clustered with the correlation deduced in S5.
2. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In, the content pre-processed in the S2 to the user profile gathered in S1, including data scrubbing, data integration, data change Change, 4 steps of hough transformation.
3. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In, user's base attribute in the S3, it is following 14 attributes, including:Whether spending amount, online duration, sex are unknown, property Be not whether female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100, disappear Take the amount of money whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, consumption The amount of money whether between 500~1000, spending amount whether be more than 1000.
4. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In user's base attribute calculating formula of similarity is as follows in the S3:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent respectively to The length of amount,For the similitude of user's base attribute, value is bigger, then the similar features in user's base attribute are just It is more.
5. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In the calculation formula of user's communication record and user's short message record similarity is as follows in the S3:
Wherein, P (C, S) is user's communication record and user's short message records similarity, and c represents the duration of call, and f represents call frequency Rate, behalf short message number;cijRepresent that user i initiates the duration of call, c to user jjiRepresent that user j initiates to converse to user i Duration, ciRepresent the call total duration of user i and neighboring user, cjRepresent the call total duration of user j and neighboring user;fij Represent that user i initiates the frequency of call, f to user jjiRepresent that user j initiates the frequency of call, f to user iiRepresent user i with The call sum frequency of neighboring user, fjRepresent the call sum frequency of user j and neighboring user;sijRepresent that user i initiates to user j The number of short message, sjiRepresent that user j initiates the number of short message, s to user iiThe short message for representing user i and neighboring user is always secondary Number, sjRepresent user j and the short message total degree of neighboring user.
6. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In user connected base station position information in one day in the S4, established LDA models, and the step of calculating the information similarity is:
S41:It is default before modeling;
S42:Build LDA models;
S43:Parameter Estimation is carried out using Gibbs model method, is distributed, calculated by calculating theme probability distribution and Word probability Documents Similarity.
7. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In, it is default before being modeled in the S41, comprise the following steps:
S411:4 kinds of labels are sticked for some regional base station location:It is home location base station respectively, job site base station, other Base station, any connection request base station is not received;The implication of this 4 kinds of labels is respectively:At user at home, user is in work State, user are being in off-mode from address and job site remote position, user mobile phone;
S412:It was divided into every 20 minutes time blocks by one day, builds a vectorial particulate being made up of 72 location tags The location expression of degree;It was divided into 8 timeslices by one day again, is respectively:0~6am, 6~9am, 9~12am, 12~2pm, 2~ 5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7, build the time description of a coarseness;
S413:Place transfer corpus is built, a lexical item in corpus under some document includes the particulate in continuous 2 hours Spend location tags and a coarseness time tag.
8. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In structure LDA models, comprise the following steps in the S42:
S421:Select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir tables Showing that Di Li Crays are distributed, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document The parameter of Dirichlet distributions, also referred to as hyper parameter;
S422:Select theme k lexical item probability distribution forWhereinRepresent k-th of theme matrix, Dir tables Showing that Di Li Crays are distributed, k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme The parameter of Dirichlet distributions, also referred to as hyper parameter;
S423:For each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial point Cloth;Select a lexical itemObey multinomial distribution;
Wherein, wi,jRepresent j-th of lexical item, z under i-th of documenti,jRepresent the theme numbering of j-th of lexical item under i-th of document, θi I-th document is represented,Represent theme zi,jDistribution.
9. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In Documents Similarity formula is as follows in the S43:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Theme i probability is got,Represent document d2Theme i probability is got, K represents theme sum in m piece documents.
10. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature It is, the formula that comprehensive similarity calculates in the S5 is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2Represent The weights calculated using message registration and short message record, set η2=0.3;η3Represent and connect base station location in one day using user The weights that information calculates, η is set3=0.6.
CN201710756540.6A 2017-08-29 2017-08-29 Telecommunication user similarity discovery method based on L DA topic model Active CN107613520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710756540.6A CN107613520B (en) 2017-08-29 2017-08-29 Telecommunication user similarity discovery method based on L DA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710756540.6A CN107613520B (en) 2017-08-29 2017-08-29 Telecommunication user similarity discovery method based on L DA topic model

Publications (2)

Publication Number Publication Date
CN107613520A true CN107613520A (en) 2018-01-19
CN107613520B CN107613520B (en) 2020-08-04

Family

ID=61056243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710756540.6A Active CN107613520B (en) 2017-08-29 2017-08-29 Telecommunication user similarity discovery method based on L DA topic model

Country Status (1)

Country Link
CN (1) CN107613520B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065174A (en) * 2018-07-27 2018-12-21 合肥工业大学 Consider the case history theme acquisition methods and device of similar constraint
CN109933657A (en) * 2019-03-21 2019-06-25 中山大学 A kind of Topics Crawling sentiment analysis method based on user characteristics optimization
CN110856159A (en) * 2018-08-21 2020-02-28 中国移动通信集团湖南有限公司 Method, device and storage medium for determining family circle members
WO2020055321A1 (en) * 2018-09-10 2020-03-19 Eureka Analytics Pte. Ltd. Telecommunications data used for lookalike analysis
CN112905740A (en) * 2021-02-04 2021-06-04 合肥工业大学 Topic preference mining method for competitive product hierarchy
TWI763165B (en) * 2020-12-09 2022-05-01 中華電信股份有限公司 Electronic device and method for predicting spending amount of customer of shopping website

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700018A (en) * 2013-12-16 2014-04-02 华中科技大学 Method for dividing users in mobile social network
CN105469104A (en) * 2015-11-03 2016-04-06 小米科技有限责任公司 Text information similarity calculating method, device and server
US20160335345A1 (en) * 2015-05-11 2016-11-17 Stratifyd, Inc. Unstructured data analytics systems and methods
CN106682170A (en) * 2016-12-27 2017-05-17 北京奇虎科技有限公司 Application searching method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700018A (en) * 2013-12-16 2014-04-02 华中科技大学 Method for dividing users in mobile social network
US20160335345A1 (en) * 2015-05-11 2016-11-17 Stratifyd, Inc. Unstructured data analytics systems and methods
CN105469104A (en) * 2015-11-03 2016-04-06 小米科技有限责任公司 Text information similarity calculating method, device and server
CN106682170A (en) * 2016-12-27 2017-05-17 北京奇虎科技有限公司 Application searching method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DIANXI SHI等: "Measuring Users Relationship Strength Using a Hierarchical Voting-based Model", 《IEEE》 *
钟晓宇等: "一种基于相似社团和节点角色划分的社交网络用户推荐方案", 《重庆邮电大学学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065174A (en) * 2018-07-27 2018-12-21 合肥工业大学 Consider the case history theme acquisition methods and device of similar constraint
CN109065174B (en) * 2018-07-27 2022-02-18 合肥工业大学 Medical record theme acquisition method and device considering similarity constraint
CN110856159A (en) * 2018-08-21 2020-02-28 中国移动通信集团湖南有限公司 Method, device and storage medium for determining family circle members
CN110856159B (en) * 2018-08-21 2022-07-26 中国移动通信集团湖南有限公司 Method, device and storage medium for determining family circle members
WO2020055321A1 (en) * 2018-09-10 2020-03-19 Eureka Analytics Pte. Ltd. Telecommunications data used for lookalike analysis
CN109933657A (en) * 2019-03-21 2019-06-25 中山大学 A kind of Topics Crawling sentiment analysis method based on user characteristics optimization
CN109933657B (en) * 2019-03-21 2021-07-09 中山大学 Topic mining emotion analysis method based on user feature optimization
TWI763165B (en) * 2020-12-09 2022-05-01 中華電信股份有限公司 Electronic device and method for predicting spending amount of customer of shopping website
CN112905740A (en) * 2021-02-04 2021-06-04 合肥工业大学 Topic preference mining method for competitive product hierarchy
CN112905740B (en) * 2021-02-04 2022-08-30 合肥工业大学 Topic preference mining method for competitive product hierarchy

Also Published As

Publication number Publication date
CN107613520B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN107613520A (en) A kind of telecommunication user similarity based on LDA topic models finds method
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN103024585B (en) Program recommendation system, program recommendation method and terminal equipment
CN101770520A (en) User interest modeling method based on user browsing behavior
CN103678647A (en) Method and system for recommending information
CN101266610B (en) Web active user website accessing mode on-line excavation method
CN105095433A (en) Recommendation method and device for entities
CN102110170B (en) System with information distribution and search functions and information distribution method
CN101354714B (en) Method for recommending problem based on probability latent semantic analysis
CN105718579A (en) Information push method based on internet-surfing log mining and user activity recognition
CN106845644A (en) A kind of heterogeneous network of the contact for learning user and Mobile solution by correlation
CN106202480A (en) A kind of network behavior based on K means and LDA bi-directional verification custom clustering method
CN109885772A (en) The education content personalized recommendation system of knowledge based map
CN106227714A (en) A kind of method and apparatus obtaining the key word generating poem based on artificial intelligence
CN110110225A (en) Online education recommended models and construction method based on user behavior data analysis
CN113806630B (en) Attention-based multi-view feature fusion cross-domain recommendation method and device
CN112559878B (en) Sequence recommendation system and recommendation method based on graph neural network
CN110009416A (en) A kind of system based on big data cleaning and AI precision marketing
CN111274413A (en) Intelligent heat supply service recommendation method based on knowledge graph
CN105511901A (en) App cold start-up recommending method based on mobile app operation list
CN113344648B (en) Advertisement recommendation method and system based on machine learning
CN110413882A (en) Information-pushing method, device and equipment
CN101901277A (en) Dynamic ontology modeling method and system based on user situation
CN112084418B (en) Microblog user community discovery method based on neighbor information and attribute network characterization learning
CN202041990U (en) Personal loan transaction platform for bank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant