CN107613520A - A kind of telecommunication user similarity based on LDA topic models finds method - Google Patents
A kind of telecommunication user similarity based on LDA topic models finds method Download PDFInfo
- Publication number
- CN107613520A CN107613520A CN201710756540.6A CN201710756540A CN107613520A CN 107613520 A CN107613520 A CN 107613520A CN 201710756540 A CN201710756540 A CN 201710756540A CN 107613520 A CN107613520 A CN 107613520A
- Authority
- CN
- China
- Prior art keywords
- user
- represent
- theme
- document
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to Data Mining, specifically disclose a kind of telecommunication user similarity based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) topic model and find method, it is that the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically link together, from four it is different from the aspect of telecommunication user similarity calculating method, this four aspect be respectively:The positional information for all base stations that base attribute, message registration, short message record and the user of user connects in one day and connection initial time, end time.Emphasis of the present invention connects base station information corpus in one day to user using LDA topic models and is modeled, utilize the statistical property of text, excavate the potential subject information being hidden in text, obtain the theme distribution of document, the similarity of document is calculated with this, for it is deep excavation telecommunication field user similar features provide it is effectively guaranteed that.
Description
Technical field
It is main based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) the present invention relates to one kind
The telecommunication user similarity for inscribing model finds method, belongs to data mining, topic model field.
Background technology
In recent years, with the rise of mobile Internet industry, the scale of global telecommunication market is increasing, and technology innovation is got over
Come faster, the competition between each operator and between operator and Internet firm is also growing more intense.Traditional talk business
Fierce impact with short message service by social networks product under Internet firm, for this phenomenon, global telecommunications operation
Business proposes Transformation Strategy one after another, and service strategy is customer-centric from being turned to centered on business.Therefore, operator must be more
Add and in depth understand client, and then adjust migration efficiency, provide a user more quality services.
In the case where big data rises to the historical background of national strategy, plus the mass users data of the accumulation of operator for many years
Under the conditions of, the potential value for fully excavating telecommunication user data not only all has for operator, or even social industry-by-industry
Significance.
In order to reach above-mentioned target, it is one of research direction that corporations' division is carried out to telecommunication user network, and corporations
An important step in division is exactly that similar users are clustered.Cluster is that analogical object is included into same cluster, different right
As being grouped into different clusters.Due to the Data Distribution of macroscopic view can be established by cluster analysis, the phase between data attribute is understood
Guan Du, and speculate correlation, so cluster is used widely in data mining.
Existing telecommunication user similarity calculating method, although it is contemplated that the characteristic attribute of user's relevant dimension, but not
With reference to other features of mobile subscriber, such as mobile phone A pp service condition, browsing history, base station position information etc., therefore
The Similarity value calculated has certain limitation, also influences the accuracy clustered afterwards indirectly.And LDA models are a kind of
It is a kind of method that subject information to text data is modeled to the probability topic model of document sets modeling.It is by three layers
Production bayesian network structure forms, based on such a hypotheses:Syntactic structure and word in document is ignored go out
In the case of existing sequencing, document is made up of several implicit themes, and these themes are by several specific words
Converge and form.Therefore, telecommunication user own base station positional information is abstracted as document, document subject matter is calculated using LDA topic models
Between similarity, in conjunction with user's base attribute, call relation and short message be related to this three aspects content, consider the phase of user
Like degree.
The content of the invention
To solve the deficiencies in the prior art, it is an object of the invention to propose a kind of telecommunications use based on LDA topic models
Family similarity finds method, this method by the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically
Link together, consider how to calculate telecommunication user similarity from four levelses, guarantee is provided for the accuracy of cluster.
In order to realize above-mentioned target, the embodiment of the present invention adopts the following technical scheme that, comprises the following steps:
S1:Gather user profile;
S2:The user profile gathered in S1 is pre-processed;
S3:Similarity is carried out to pre-processing the base attribute in information, user's communication record and user's short message record in S2
Calculate;
S4:Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate
The information similarity;
S5:Comprehensive phase knowledge and magnanimity calculate, thus it is speculated that correlation;
S6:Clustered with the correlation deduced in S4.
The user profile gathered in S2 is pre-processed, including data scrubbing, data integration, data conversion, hough transformation 4
Individual step.
User's base attribute in S3, it is following 14 attributes, including:Whether spending amount, online duration, sex unknown,
Sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100,
Spending amount whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, disappear
Take the amount of money whether between 500~1000, spending amount whether be more than 1000.
The base attribute of telecommunication user:Each user is abstracted into a characteristic vector, weighed with vectorial angle cosine value
Measure the similitude of user's base attribute.Value is bigger, then the similar features in user's base attribute are more.
User's communication records:First, from duration of call angle, the duration of call depends not only on two users' Interworking Telephone
Time measure, it is also contemplated that the call scenarios of the two users and neighboring user.Second, from talk times angle, it is assumed that identical
In measurement period, user a and user b has carried out the call of one time 30 minutes, and user a and user c carries out the call of 6 times 5 minutes,
Obvious user a and user c contacts even closer.Therefore, the relative duration of call is longer between two users, and talk times are more, similar
Degree is higher.
User's short message records:Short message record is similar with message registration, but only considers the bar number of short message exchange between user, both sides
Short message exchange bar number accounts for it and exchanges that the ratio of bar number is bigger, and similarity is higher with neighboring user.
User connected base station position information in one day:Some time was divided into by one day, according to user when different
Between in section connection base station location tags, input of the transfer document in structure place as LDA topic models, obtain the theme of document
Distribution, the similarity between document is calculated with this.
The similarity calculating method based on telecommunication user base attribute, formula are as follows:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent to
The length of amount.
The calculating formula of similarity recorded based on telecommunication user message registration and short message is as follows:
Wherein, c represents the duration of call, and f represents voice frequency, behalf short message number.cijRepresent that user i initiates to user j
The duration of call, cjiRepresent that user j initiates the duration of call, c to user iiRepresent user i and neighboring user (including user j)
Call total duration, cjUser j and neighboring user (including user i) call total duration are represented, the implication of its dependent variable is with this
Analogize.
The positional information that user connected base station in one day establishes LDA models, comprises the following steps before modeling:
(1) 4 kinds of labels are sticked for some regional base station location:It is home location base station (Home) respectively, job site
Base station (Work), other base stations (Other), any connection request base station (No Reception) is not received.This 4 kinds of labels
Implication is respectively:At user at home;User is in running order;User is from address and job site remote position;With
Family mobile phone is in off-mode.
(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built
Put describing mode:It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long
Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.
(3) to prevent the situation of over-fitting, the time describing mode of a coarseness is then built, one day is divided into
8 timeslices, it is respectively:0~6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, compile
Number be 0~7.
(4) finally, place transfer corpus is built.A lexical item in corpus under some document includes continuous 2 hours
Interior fine grained location label and a coarseness time tag, such as HHHHHH0, HWWWWW2 etc..
All documents in corpus are shifted according to given site, build LDA models.The document sets are by specifying user one
Place change sequence is formed in it, and the lexical item collection is made up of 6 fine grained location labels and 1 coarseness time tag.LDA
The generating process of model, comprises the following steps:
(1) select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir
Representing the distribution of Di Li Crays, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document
The parameter of Dirichlet distributions, also referred to as hyper parameter.
(2) select theme k lexical item probability distribution forWhereinK-th of theme matrix is represented,
Dir represents the distribution of Di Li Crays, and k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme
The parameter of Dirichlet distributions, also referred to as hyper parameter.
(3) for each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial
Distribution;Select a lexical itemObey multinomial distribution.Wherein, wi,jRepresent i-th of text
Lower j-th of the lexical item of shelves, zi,jRepresent the theme numbering of j-th of lexical item under i-th of document, θiI-th document is represented,Represent
Theme zi,jDistribution.
The LDA models obtained according to said process, calculate joint probability distribution of some document based on hyper parameter:
Wherein ωmRepresent the vector that all words are formed in document m, zmRepresent the theme vector corresponding to document m, θmRepresent
Document m theme probability distribution, φ represent the lexical item probability distribution of all themes, and α, β are the hyper parameters of Di Li Crays distribution, Nm
Represent document m length, wm,nN-th of lexical item under m-th of document is represented,Represent theme zm,nDistribution, zm,nRepresent m
The theme numbering of n-th of lexical item under individual document.
The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method
Estimation, topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set
Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every articlek (d), each
Lexical item probability distribution P (w | z=k)=φ under themew (k)。
The theme probability distribution calculation formula of every article is as follows:
Wherein, θm,kRepresent k-th of theme of m piece articles, nm,kRepresent the number of kth theme, K tables occur in document m
Show theme sum in m piece documents, α is the first parameter vector.
The calculation formula of lexical item probability distribution under each theme is as follows:
Wherein, φk,wRepresent the w words under k-th of theme, nk,wRepresent time that w lexical items occur under k-th of theme
Number, V represent the sum of word under k-th of theme, and β is the second parameter vector.
According to the theme probability distribution of above-mentioned gained document, the distance of probability distribution variances between two document subject matters is calculated,
Formula is as follows:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Get the general of theme i
Rate,Represent document d2Get theme i probability.
A kind of telecommunication user similarity based on LDA topic models finds method, by the multidimensional characteristic and base of user
Organically combined in the motif discovery algorithm (LDA) of probability, synthesis show that the calculating formula of similarity of user is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2
The weights calculated using message registration and short message record are represented, η is set2=0.3;η3Represent and connect base station in one day using user
The weights of positional information calculation, η is set3=0.6, remaining parameter and it is consistent above.
Beneficial effect of the present invention:
1. introducing telecommunication user connects base station information, base station is divided into it is different classes of, to the position row in user one day
To be modeled using LDA, similitude of the user in daily behavior is fully excavated.
2. introduce timeslice division user, from thick, thin two granularities portray user one day in daily habits.At utmost
Avoid the generation of over-fitting.
3. the multidimensional characteristic of telecommunication user and the motif discovery algorithm (LDA) based on probabilistic model are organically contacted one
Rise, consider telecommunication user similarity from four levelses, it is comprehensive and reasonable.
Brief description of the drawings
Fig. 1 is the similarity calculating method schematic diagram of the present invention.
Fig. 2 is the LDA topic model figures of the present invention.
Fig. 3 is the topological structure schematic diagram for the LDA models that the present invention uses.LDA models think that every document is by multiple
Theme is mixed, and each theme is characterized by multiple lexical items.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with specific embodiment, to this
Invention is further elaborated.It should be appreciated that particular embodiments described herein is only to explain the present invention, and do not have to
It is of the invention in limiting.
A kind of telecommunication user similarity based on LDA topic models finds method, considers that user's is similar from four levelses
Characteristic is spent, including:
The base attribute of telecommunication user:Each user is abstracted into a characteristic vector, weighed with vectorial angle cosine value
Measure the similitude of user's base attribute.Value is bigger, then the similar features in user's base attribute are more.
User's communication records:First, from duration of call angle, the duration of call depends not only on two users' Interworking Telephone
Time measure, it is also contemplated that the call scenarios of the two users and neighboring user.Second, from talk times angle, it is assumed that identical
In measurement period, user a and user b has carried out the call of one time 30 minutes, and user a and user c carries out the call of 6 times 5 minutes,
Obvious user a and user c contacts even closer.Therefore, the relative duration of call is longer between two users, and talk times are more, similar
Degree is higher.
User's short message records:Short message record is similar with message registration, but only considers the bar number of short message exchange between user, both sides
Short message exchange bar number accounts for it and exchanges that the ratio of bar number is bigger, and similarity is higher with neighboring user.
User connected base station position information in one day:Some time was divided into by 24 hours one day, is existed according to user
The location tags of connection base station in different time sections, input of the transfer document in structure place as LDA topic models, obtain document
Theme distribution, the similarity between document is calculated with this.
As shown in figure 1, a kind of telecommunication user similarity based on LDA topic models finds method, its required data
Following steps are being had been subjected to using preceding, including:Data scrubbing, data integration, hough transformation, data conversion.
Next from the base attribute of the extracting data user after the completion of pretreatment, there are following 14:Spending amount, on
Whether net duration, sex unknown, sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, consumption
The amount of money whether between 0~100, spending amount whether between 100~200, whether spending amount between 200~300, spending amount
Whether between 300~500, spending amount whether between 500~1000, spending amount whether be more than 1000.These attributes are
By hough transformation and conversion, therefore user property can be abstracted as characteristic vector, similarity is calculated using equation below:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent to
The length of amount.
Then the message registration and short message that user is extracted from data set record, the duration of call, call time between counting user
Number and short message number, similarity is calculated using equation below:
Wherein, c represents the duration of call, and f represents voice frequency, behalf short message number.cijRepresent that user i initiates to user j
The duration of call, cjiRepresent that user j initiates the duration of call, c to user iiRepresent user i and neighboring user (including user j)
Call total duration, cjUser j and neighboring user (including user i) call total duration are represented, the implication of its dependent variable is with this
Analogize.
Base station information is finally connected according to user and establishes LDA models, is specifically comprised the following steps:
(1) 4 kinds of labels are sticked for some regional base station location:It is home location base station (Home) respectively, job site
Base station (Work), other base stations (Other), any connection request base station (No Reception) is not received.This 4 kinds of labels
Implication is respectively:At user at home;User is in running order;User is from address and job site remote position;With
Family mobile phone is in off-mode.
(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built
Put describing mode:It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long
Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.To prevent
The situation of fitting, the time describing mode of a coarseness is then built, was divided into 8 timeslices by one day, is respectively:0~
6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7.
(3) place transfer corpus is built, a lexical item in corpus under some document includes thin in continuous 2 hours
Granularity base station location label and a coarseness time tag, all documents in corpus, structure are shifted according to given site
LDA models, as shown in Figure 2.
The generating process of LDA models, comprises the following steps:
(1) document i theme probability distribution is selectedWhereinRepresent i-th document matrix, Dir generations
Table document i obeys the distribution of Di Li Crays, and i belongs to { 1 ... M }, and M is document number,It is hyper parameter.
(2) theme k lexical item probability distribution is selectedWhereinRepresent k-th of theme matrix, Dir
Representing theme k and obey the distribution of Di Li Crays, k belongs to { 1 ... K }, and K is theme number,It is hyper parameter.
(3) for each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial
Distribution;Select a lexical itemObey multinomial distribution.Wherein, wi,jRepresent i-th of text
Lower j-th of the lexical item of shelves, zi,jRepresent the theme numbering of j-th of lexical item under i-th of document.θiI-th document is represented,Represent
Theme zi,jDistribution.
The LDA models obtained according to said process, it can be found that LDA models have clearly hierarchical structure, such as Fig. 3 institutes
Show, every document is mixed by multiple themes, and each theme is characterized by multiple lexical items.Thus some document is calculated to be based on
The joint probability distribution of hyper parameter:
Wherein ωmRepresent the vector that all words are formed in document m, zmRepresent the theme vector corresponding to document m, θmRepresent
Document m theme probability distribution, φ represent the lexical item probability distribution of all themes, and α, β are the hyper parameters of Di Li Crays distribution, Nm
Represent document m length, wm,nN-th of lexical item under m-th of document is represented,Represent theme zm,nDistribution, zm,nRepresent m
The theme numbering of n-th of lexical item under individual document.
The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method
Estimation.Topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set
Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every articlek (d), each
Lexical item probability distribution P (w | z=k)=φ under themew (k)。
The theme probability distribution calculation formula of every article is as follows:
Wherein, θm,kRepresent k-th of theme of m piece articles, nm,kRepresent the number of kth theme, K tables occur in document m
Show theme sum in m piece documents, α is the first parameter vector.
The calculation formula of lexical item probability distribution is as follows under each theme:
Wherein, φk,wRepresent the w words under k-th of theme, nk,wRepresent time that w lexical items occur under k-th of theme
Number, V represent the sum of word under k-th of theme, and β is the second parameter vector.
Therefore, can be as follows according to the distance metric Documents Similarity of probability distribution between two document subject matters, calculation formula:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Get the general of theme i
Rate,Represent document d2Get theme i probability.
Finally, the multidimensional characteristic of user and the motif discovery algorithm (LDA) based on probability are organically combined, synthesis draws use
The calculating formula of similarity at family is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2
The weights calculated using message registration and short message record are represented, η is set2=0.3;η3Represent and connect base station in one day using user
The weights of positional information calculation, η is set3=0.6, remaining parameter and it is consistent above.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of from which, example all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention will by appended right
Ask rather than described above limits, it is intended that all changes in the implication and scope of the equivalency of claim will be fallen
Include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped
Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity
Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
It is appreciated that other embodiment.
Claims (10)
1. a kind of telecommunication user similarity based on LDA topic models finds method, it is characterised in that comprises the following steps:
S1:Gather user profile;
S2:The user profile gathered in S1 is pre-processed;
S3:Similarity is carried out respectively to pre-processing the base attribute in information, user's communication record and user's short message record in S2
Calculate;
S4:Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate the letter
Cease similarity;
S5:Comprehensive S3 and S4 phase knowledge and magnanimity calculate, thus it is speculated that correlation;
S6:Clustered with the correlation deduced in S5.
2. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists
In, the content pre-processed in the S2 to the user profile gathered in S1, including data scrubbing, data integration, data change
Change, 4 steps of hough transformation.
3. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists
In, user's base attribute in the S3, it is following 14 attributes, including:Whether spending amount, online duration, sex are unknown, property
Be not whether female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100, disappear
Take the amount of money whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, consumption
The amount of money whether between 500~1000, spending amount whether be more than 1000.
4. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists
In user's base attribute calculating formula of similarity is as follows in the S3:
Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent respectively to
The length of amount,For the similitude of user's base attribute, value is bigger, then the similar features in user's base attribute are just
It is more.
5. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists
In the calculation formula of user's communication record and user's short message record similarity is as follows in the S3:
Wherein, P (C, S) is user's communication record and user's short message records similarity, and c represents the duration of call, and f represents call frequency
Rate, behalf short message number;cijRepresent that user i initiates the duration of call, c to user jjiRepresent that user j initiates to converse to user i
Duration, ciRepresent the call total duration of user i and neighboring user, cjRepresent the call total duration of user j and neighboring user;fij
Represent that user i initiates the frequency of call, f to user jjiRepresent that user j initiates the frequency of call, f to user iiRepresent user i with
The call sum frequency of neighboring user, fjRepresent the call sum frequency of user j and neighboring user;sijRepresent that user i initiates to user j
The number of short message, sjiRepresent that user j initiates the number of short message, s to user iiThe short message for representing user i and neighboring user is always secondary
Number, sjRepresent user j and the short message total degree of neighboring user.
6. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists
In user connected base station position information in one day in the S4, established LDA models, and the step of calculating the information similarity is:
S41:It is default before modeling;
S42:Build LDA models;
S43:Parameter Estimation is carried out using Gibbs model method, is distributed, calculated by calculating theme probability distribution and Word probability
Documents Similarity.
7. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists
In, it is default before being modeled in the S41, comprise the following steps:
S411:4 kinds of labels are sticked for some regional base station location:It is home location base station respectively, job site base station, other
Base station, any connection request base station is not received;The implication of this 4 kinds of labels is respectively:At user at home, user is in work
State, user are being in off-mode from address and job site remote position, user mobile phone;
S412:It was divided into every 20 minutes time blocks by one day, builds a vectorial particulate being made up of 72 location tags
The location expression of degree;It was divided into 8 timeslices by one day again, is respectively:0~6am, 6~9am, 9~12am, 12~2pm, 2~
5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7, build the time description of a coarseness;
S413:Place transfer corpus is built, a lexical item in corpus under some document includes the particulate in continuous 2 hours
Spend location tags and a coarseness time tag.
8. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists
In structure LDA models, comprise the following steps in the S42:
S421:Select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir tables
Showing that Di Li Crays are distributed, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document
The parameter of Dirichlet distributions, also referred to as hyper parameter;
S422:Select theme k lexical item probability distribution forWhereinRepresent k-th of theme matrix, Dir tables
Showing that Di Li Crays are distributed, k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme
The parameter of Dirichlet distributions, also referred to as hyper parameter;
S423:For each word w in documenti,j, select a theme zi,j~Multinomial (θi) obey multinomial point
Cloth;Select a lexical itemObey multinomial distribution;
Wherein, wi,jRepresent j-th of lexical item, z under i-th of documenti,jRepresent the theme numbering of j-th of lexical item under i-th of document, θi
I-th document is represented,Represent theme zi,jDistribution.
9. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists
In Documents Similarity formula is as follows in the S43:
Wherein, d1、d2Two documents are represented, i represents i-th of theme numbering,Represent document d1Theme i probability is got,Represent document d2Theme i probability is got, K represents theme sum in m piece documents.
10. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature
It is, the formula that comprehensive similarity calculates in the S5 is as follows:
Wherein, u1, u2 represent user 1 and user 2;η1The weights calculated using base attribute are represented, η is set1=0.1;η2Represent
The weights calculated using message registration and short message record, set η2=0.3;η3Represent and connect base station location in one day using user
The weights that information calculates, η is set3=0.6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710756540.6A CN107613520B (en) | 2017-08-29 | 2017-08-29 | Telecommunication user similarity discovery method based on L DA topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710756540.6A CN107613520B (en) | 2017-08-29 | 2017-08-29 | Telecommunication user similarity discovery method based on L DA topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107613520A true CN107613520A (en) | 2018-01-19 |
CN107613520B CN107613520B (en) | 2020-08-04 |
Family
ID=61056243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710756540.6A Active CN107613520B (en) | 2017-08-29 | 2017-08-29 | Telecommunication user similarity discovery method based on L DA topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107613520B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065174A (en) * | 2018-07-27 | 2018-12-21 | 合肥工业大学 | Consider the case history theme acquisition methods and device of similar constraint |
CN109933657A (en) * | 2019-03-21 | 2019-06-25 | 中山大学 | A kind of Topics Crawling sentiment analysis method based on user characteristics optimization |
CN110856159A (en) * | 2018-08-21 | 2020-02-28 | 中国移动通信集团湖南有限公司 | Method, device and storage medium for determining family circle members |
WO2020055321A1 (en) * | 2018-09-10 | 2020-03-19 | Eureka Analytics Pte. Ltd. | Telecommunications data used for lookalike analysis |
CN112905740A (en) * | 2021-02-04 | 2021-06-04 | 合肥工业大学 | Topic preference mining method for competitive product hierarchy |
TWI763165B (en) * | 2020-12-09 | 2022-05-01 | 中華電信股份有限公司 | Electronic device and method for predicting spending amount of customer of shopping website |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103700018A (en) * | 2013-12-16 | 2014-04-02 | 华中科技大学 | Method for dividing users in mobile social network |
CN105469104A (en) * | 2015-11-03 | 2016-04-06 | 小米科技有限责任公司 | Text information similarity calculating method, device and server |
US20160335345A1 (en) * | 2015-05-11 | 2016-11-17 | Stratifyd, Inc. | Unstructured data analytics systems and methods |
CN106682170A (en) * | 2016-12-27 | 2017-05-17 | 北京奇虎科技有限公司 | Application searching method and device |
-
2017
- 2017-08-29 CN CN201710756540.6A patent/CN107613520B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103700018A (en) * | 2013-12-16 | 2014-04-02 | 华中科技大学 | Method for dividing users in mobile social network |
US20160335345A1 (en) * | 2015-05-11 | 2016-11-17 | Stratifyd, Inc. | Unstructured data analytics systems and methods |
CN105469104A (en) * | 2015-11-03 | 2016-04-06 | 小米科技有限责任公司 | Text information similarity calculating method, device and server |
CN106682170A (en) * | 2016-12-27 | 2017-05-17 | 北京奇虎科技有限公司 | Application searching method and device |
Non-Patent Citations (2)
Title |
---|
DIANXI SHI等: "Measuring Users Relationship Strength Using a Hierarchical Voting-based Model", 《IEEE》 * |
钟晓宇等: "一种基于相似社团和节点角色划分的社交网络用户推荐方案", 《重庆邮电大学学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065174A (en) * | 2018-07-27 | 2018-12-21 | 合肥工业大学 | Consider the case history theme acquisition methods and device of similar constraint |
CN109065174B (en) * | 2018-07-27 | 2022-02-18 | 合肥工业大学 | Medical record theme acquisition method and device considering similarity constraint |
CN110856159A (en) * | 2018-08-21 | 2020-02-28 | 中国移动通信集团湖南有限公司 | Method, device and storage medium for determining family circle members |
CN110856159B (en) * | 2018-08-21 | 2022-07-26 | 中国移动通信集团湖南有限公司 | Method, device and storage medium for determining family circle members |
WO2020055321A1 (en) * | 2018-09-10 | 2020-03-19 | Eureka Analytics Pte. Ltd. | Telecommunications data used for lookalike analysis |
CN109933657A (en) * | 2019-03-21 | 2019-06-25 | 中山大学 | A kind of Topics Crawling sentiment analysis method based on user characteristics optimization |
CN109933657B (en) * | 2019-03-21 | 2021-07-09 | 中山大学 | Topic mining emotion analysis method based on user feature optimization |
TWI763165B (en) * | 2020-12-09 | 2022-05-01 | 中華電信股份有限公司 | Electronic device and method for predicting spending amount of customer of shopping website |
CN112905740A (en) * | 2021-02-04 | 2021-06-04 | 合肥工业大学 | Topic preference mining method for competitive product hierarchy |
CN112905740B (en) * | 2021-02-04 | 2022-08-30 | 合肥工业大学 | Topic preference mining method for competitive product hierarchy |
Also Published As
Publication number | Publication date |
---|---|
CN107613520B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107613520A (en) | A kind of telecommunication user similarity based on LDA topic models finds method | |
CN104899273B (en) | A kind of Web Personalization method based on topic and relative entropy | |
CN103024585B (en) | Program recommendation system, program recommendation method and terminal equipment | |
CN101770520A (en) | User interest modeling method based on user browsing behavior | |
CN103678647A (en) | Method and system for recommending information | |
CN101266610B (en) | Web active user website accessing mode on-line excavation method | |
CN105095433A (en) | Recommendation method and device for entities | |
CN102110170B (en) | System with information distribution and search functions and information distribution method | |
CN101354714B (en) | Method for recommending problem based on probability latent semantic analysis | |
CN105718579A (en) | Information push method based on internet-surfing log mining and user activity recognition | |
CN106845644A (en) | A kind of heterogeneous network of the contact for learning user and Mobile solution by correlation | |
CN106202480A (en) | A kind of network behavior based on K means and LDA bi-directional verification custom clustering method | |
CN109885772A (en) | The education content personalized recommendation system of knowledge based map | |
CN106227714A (en) | A kind of method and apparatus obtaining the key word generating poem based on artificial intelligence | |
CN110110225A (en) | Online education recommended models and construction method based on user behavior data analysis | |
CN113806630B (en) | Attention-based multi-view feature fusion cross-domain recommendation method and device | |
CN112559878B (en) | Sequence recommendation system and recommendation method based on graph neural network | |
CN110009416A (en) | A kind of system based on big data cleaning and AI precision marketing | |
CN111274413A (en) | Intelligent heat supply service recommendation method based on knowledge graph | |
CN105511901A (en) | App cold start-up recommending method based on mobile app operation list | |
CN113344648B (en) | Advertisement recommendation method and system based on machine learning | |
CN110413882A (en) | Information-pushing method, device and equipment | |
CN101901277A (en) | Dynamic ontology modeling method and system based on user situation | |
CN112084418B (en) | Microblog user community discovery method based on neighbor information and attribute network characterization learning | |
CN202041990U (en) | Personal loan transaction platform for bank |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |