CN107613520A

CN107613520A - A kind of telecommunication user similarity based on LDA topic models finds method

Info

Publication number: CN107613520A
Application number: CN201710756540.6A
Authority: CN
Inventors: 解绍词; 吴新凯; 徐光侠; 刘宴兵; 程金伟
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-08-29
Filing date: 2017-08-29
Publication date: 2018-01-19
Anticipated expiration: 2037-08-29
Also published as: CN107613520B

Abstract

The present invention relates to Data Mining, specifically disclose a kind of telecommunication user similarity based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) topic model and find method, it is that the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically link together, from four it is different from the aspect of telecommunication user similarity calculating method, this four aspect be respectively：The positional information for all base stations that base attribute, message registration, short message record and the user of user connects in one day and connection initial time, end time.Emphasis of the present invention connects base station information corpus in one day to user using LDA topic models and is modeled, utilize the statistical property of text, excavate the potential subject information being hidden in text, obtain the theme distribution of document, the similarity of document is calculated with this, for it is deep excavation telecommunication field user similar features provide it is effectively guaranteed that.

Description

A kind of telecommunication user similarity based on LDA topic models finds method

Technical field

It is main based on LDA (the potential Di Li Crays distributions of Latent Dirichlet Allocation) the present invention relates to one kind The telecommunication user similarity for inscribing model finds method, belongs to data mining, topic model field.

Background technology

In recent years, with the rise of mobile Internet industry, the scale of global telecommunication market is increasing, and technology innovation is got over Come faster, the competition between each operator and between operator and Internet firm is also growing more intense.Traditional talk business Fierce impact with short message service by social networks product under Internet firm, for this phenomenon, global telecommunications operation Business proposes Transformation Strategy one after another, and service strategy is customer-centric from being turned to centered on business.Therefore, operator must be more Add and in depth understand client, and then adjust migration efficiency, provide a user more quality services.

In the case where big data rises to the historical background of national strategy, plus the mass users data of the accumulation of operator for many years Under the conditions of, the potential value for fully excavating telecommunication user data not only all has for operator, or even social industry-by-industry Significance.

In order to reach above-mentioned target, it is one of research direction that corporations' division is carried out to telecommunication user network, and corporations An important step in division is exactly that similar users are clustered.Cluster is that analogical object is included into same cluster, different right As being grouped into different clusters.Due to the Data Distribution of macroscopic view can be established by cluster analysis, the phase between data attribute is understood Guan Du, and speculate correlation, so cluster is used widely in data mining.

Existing telecommunication user similarity calculating method, although it is contemplated that the characteristic attribute of user's relevant dimension, but not With reference to other features of mobile subscriber, such as mobile phone A pp service condition, browsing history, base station position information etc., therefore The Similarity value calculated has certain limitation, also influences the accuracy clustered afterwards indirectly.And LDA models are a kind of It is a kind of method that subject information to text data is modeled to the probability topic model of document sets modeling.It is by three layers Production bayesian network structure forms, based on such a hypotheses：Syntactic structure and word in document is ignored go out In the case of existing sequencing, document is made up of several implicit themes, and these themes are by several specific words Converge and form.Therefore, telecommunication user own base station positional information is abstracted as document, document subject matter is calculated using LDA topic models Between similarity, in conjunction with user's base attribute, call relation and short message be related to this three aspects content, consider the phase of user Like degree.

The content of the invention

To solve the deficiencies in the prior art, it is an object of the invention to propose a kind of telecommunications use based on LDA topic models Family similarity finds method, this method by the multidimensional characteristic of telecommunication user and the motif discovery algorithm based on probabilistic model organically Link together, consider how to calculate telecommunication user similarity from four levelses, guarantee is provided for the accuracy of cluster.

In order to realize above-mentioned target, the embodiment of the present invention adopts the following technical scheme that, comprises the following steps：

S1：Gather user profile；

S2：The user profile gathered in S1 is pre-processed；

S3：Similarity is carried out to pre-processing the base attribute in information, user's communication record and user's short message record in S2 Calculate；

S4：Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate The information similarity；

S5：Comprehensive phase knowledge and magnanimity calculate, thus it is speculated that correlation；

S6：Clustered with the correlation deduced in S4.

The user profile gathered in S2 is pre-processed, including data scrubbing, data integration, data conversion, hough transformation 4 Individual step.

User's base attribute in S3, it is following 14 attributes, including：Whether spending amount, online duration, sex unknown, Sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100, Spending amount whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, disappear Take the amount of money whether between 500~1000, spending amount whether be more than 1000.

The base attribute of telecommunication user：Each user is abstracted into a characteristic vector, weighed with vectorial angle cosine value Measure the similitude of user's base attribute.Value is bigger, then the similar features in user's base attribute are more.

User's communication records：First, from duration of call angle, the duration of call depends not only on two users' Interworking Telephone Time measure, it is also contemplated that the call scenarios of the two users and neighboring user.Second, from talk times angle, it is assumed that identical In measurement period, user a and user b has carried out the call of one time 30 minutes, and user a and user c carries out the call of 6 times 5 minutes, Obvious user a and user c contacts even closer.Therefore, the relative duration of call is longer between two users, and talk times are more, similar Degree is higher.

User's short message records：Short message record is similar with message registration, but only considers the bar number of short message exchange between user, both sides Short message exchange bar number accounts for it and exchanges that the ratio of bar number is bigger, and similarity is higher with neighboring user.

User connected base station position information in one day：Some time was divided into by one day, according to user when different Between in section connection base station location tags, input of the transfer document in structure place as LDA topic models, obtain the theme of document Distribution, the similarity between document is calculated with this.

The similarity calculating method based on telecommunication user base attribute, formula are as follows：

Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent to The length of amount.

The calculating formula of similarity recorded based on telecommunication user message registration and short message is as follows：

Wherein, c represents the duration of call, and f represents voice frequency, behalf short message number.c_ijRepresent that user i initiates to user j The duration of call, c_jiRepresent that user j initiates the duration of call, c to user i_iRepresent user i and neighboring user (including user j) Call total duration, c_jUser j and neighboring user (including user i) call total duration are represented, the implication of its dependent variable is with this Analogize.

The positional information that user connected base station in one day establishes LDA models, comprises the following steps before modeling：

(1) 4 kinds of labels are sticked for some regional base station location：It is home location base station (Home) respectively, job site Base station (Work), other base stations (Other), any connection request base station (No Reception) is not received.This 4 kinds of labels Implication is respectively：At user at home；User is in running order；User is from address and job site remote position；With Family mobile phone is in off-mode.

(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built Put describing mode：It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.

(3) to prevent the situation of over-fitting, the time describing mode of a coarseness is then built, one day is divided into 8 timeslices, it is respectively：0~6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, compile Number be 0~7.

(4) finally, place transfer corpus is built.A lexical item in corpus under some document includes continuous 2 hours Interior fine grained location label and a coarseness time tag, such as HHHHHH0, HWWWWW2 etc..

All documents in corpus are shifted according to given site, build LDA models.The document sets are by specifying user one Place change sequence is formed in it, and the lexical item collection is made up of 6 fine grained location labels and 1 coarseness time tag.LDA The generating process of model, comprises the following steps：

(1) select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir Representing the distribution of Di Li Crays, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document The parameter of Dirichlet distributions, also referred to as hyper parameter.

(2) select theme k lexical item probability distribution forWhereinK-th of theme matrix is represented, Dir represents the distribution of Di Li Crays, and k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme The parameter of Dirichlet distributions, also referred to as hyper parameter.

(3) for each word w in document_i,j, select a theme z_i,j~Multinomial (θ_i) obey multinomial Distribution；Select a lexical itemObey multinomial distribution.Wherein, w_i,jRepresent i-th of text Lower j-th of the lexical item of shelves, z_i,jRepresent the theme numbering of j-th of lexical item under i-th of document, θ_iI-th document is represented,Represent Theme z_i,jDistribution.

The LDA models obtained according to said process, calculate joint probability distribution of some document based on hyper parameter：

Wherein ω_mRepresent the vector that all words are formed in document m, z_mRepresent the theme vector corresponding to document m, θ_mRepresent Document m theme probability distribution, φ represent the lexical item probability distribution of all themes, and α, β are the hyper parameters of Di Li Crays distribution, N_m Represent document m length, w_m,nN-th of lexical item under m-th of document is represented,Represent theme z_m,nDistribution, z_m,nRepresent m The theme numbering of n-th of lexical item under individual document.

The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method Estimation, topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every article_k ^(d), each Lexical item probability distribution P (w | z=k)=φ under theme_w ^(k)。

The theme probability distribution calculation formula of every article is as follows：

Wherein, θ_m,kRepresent k-th of theme of m piece articles, n_m,kRepresent the number of kth theme, K tables occur in document m Show theme sum in m piece documents, α is the first parameter vector.

The calculation formula of lexical item probability distribution under each theme is as follows：

Wherein, φ_k,wRepresent the w words under k-th of theme, n_k,wRepresent time that w lexical items occur under k-th of theme Number, V represent the sum of word under k-th of theme, and β is the second parameter vector.

According to the theme probability distribution of above-mentioned gained document, the distance of probability distribution variances between two document subject matters is calculated, Formula is as follows：

Wherein, d₁、d₂Two documents are represented, i represents i-th of theme numbering,Represent document d₁Get the general of theme i Rate,Represent document d₂Get theme i probability.

A kind of telecommunication user similarity based on LDA topic models finds method, by the multidimensional characteristic and base of user Organically combined in the motif discovery algorithm (LDA) of probability, synthesis show that the calculating formula of similarity of user is as follows：

Wherein, u1, u2 represent user 1 and user 2；η₁The weights calculated using base attribute are represented, η is set₁=0.1；η₂ The weights calculated using message registration and short message record are represented, η is set₂=0.3；η₃Represent and connect base station in one day using user The weights of positional information calculation, η is set₃=0.6, remaining parameter and it is consistent above.

Beneficial effect of the present invention：

1. introducing telecommunication user connects base station information, base station is divided into it is different classes of, to the position row in user one day To be modeled using LDA, similitude of the user in daily behavior is fully excavated.

2. introduce timeslice division user, from thick, thin two granularities portray user one day in daily habits.At utmost Avoid the generation of over-fitting.

3. the multidimensional characteristic of telecommunication user and the motif discovery algorithm (LDA) based on probabilistic model are organically contacted one Rise, consider telecommunication user similarity from four levelses, it is comprehensive and reasonable.

Brief description of the drawings

Fig. 1 is the similarity calculating method schematic diagram of the present invention.

Fig. 2 is the LDA topic model figures of the present invention.

Fig. 3 is the topological structure schematic diagram for the LDA models that the present invention uses.LDA models think that every document is by multiple Theme is mixed, and each theme is characterized by multiple lexical items.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with specific embodiment, to this Invention is further elaborated.It should be appreciated that particular embodiments described herein is only to explain the present invention, and do not have to It is of the invention in limiting.

A kind of telecommunication user similarity based on LDA topic models finds method, considers that user's is similar from four levelses Characteristic is spent, including：

User connected base station position information in one day：Some time was divided into by 24 hours one day, is existed according to user The location tags of connection base station in different time sections, input of the transfer document in structure place as LDA topic models, obtain document Theme distribution, the similarity between document is calculated with this.

As shown in figure 1, a kind of telecommunication user similarity based on LDA topic models finds method, its required data Following steps are being had been subjected to using preceding, including：Data scrubbing, data integration, hough transformation, data conversion.

Next from the base attribute of the extracting data user after the completion of pretreatment, there are following 14：Spending amount, on Whether net duration, sex unknown, sex whether be female, sex whether be man, whether urban district, whether county town, whether rural area, consumption The amount of money whether between 0~100, spending amount whether between 100~200, whether spending amount between 200~300, spending amount Whether between 300~500, spending amount whether between 500~1000, spending amount whether be more than 1000.These attributes are By hough transformation and conversion, therefore user property can be abstracted as characteristic vector, similarity is calculated using equation below：

Then the message registration and short message that user is extracted from data set record, the duration of call, call time between counting user Number and short message number, similarity is calculated using equation below：

Base station information is finally connected according to user and establishes LDA models, is specifically comprised the following steps：

(2) the telecommunication user stroke of one day is abstracted as geographical position sequence label.First, a fine-grained position is built Put describing mode：It was divided into every 20 minutes time blocks by one day, selects the base station location label that duration in the block is most long Label as the block.Therefore certain user is just abstracted as the vector being made up of 72 location tags in one day.To prevent The situation of fitting, the time describing mode of a coarseness is then built, was divided into 8 timeslices by one day, is respectively：0~ 6am, 6~9am, 9~12am, 12~2pm, 2~5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7.

(3) place transfer corpus is built, a lexical item in corpus under some document includes thin in continuous 2 hours Granularity base station location label and a coarseness time tag, all documents in corpus, structure are shifted according to given site LDA models, as shown in Figure 2.

The generating process of LDA models, comprises the following steps：

(1) document i theme probability distribution is selectedWhereinRepresent i-th document matrix, Dir generations Table document i obeys the distribution of Di Li Crays, and i belongs to { 1 ... M }, and M is document number,It is hyper parameter.

(2) theme k lexical item probability distribution is selectedWhereinRepresent k-th of theme matrix, Dir Representing theme k and obey the distribution of Di Li Crays, k belongs to { 1 ... K }, and K is theme number,It is hyper parameter.

(3) for each word w in document_i,j, select a theme z_i,j~Multinomial (θ_i) obey multinomial Distribution；Select a lexical itemObey multinomial distribution.Wherein, w_i,jRepresent i-th of text Lower j-th of the lexical item of shelves, z_i,jRepresent the theme numbering of j-th of lexical item under i-th of document.θ_iI-th document is represented,Represent Theme z_i,jDistribution.

The LDA models obtained according to said process, it can be found that LDA models have clearly hierarchical structure, such as Fig. 3 institutes Show, every document is mixed by multiple themes, and each theme is characterized by multiple lexical items.Thus some document is calculated to be based on The joint probability distribution of hyper parameter：

The joint probability distribution according to obtained by said process, in modeling process, parameter is carried out using Gibbs model method Estimation.Topic (theme) initial number K=30, hyper parameter α=30/K, β=0.01, the iteration time of gibbs sampler are set Number is 1000 times, carries out Topics Crawling to corpus, generates theme probability distribution P (z=k)=θ of every article_k ^(d), each Lexical item probability distribution P (w | z=k)=φ under theme_w ^(k)。

The calculation formula of lexical item probability distribution is as follows under each theme：

Therefore, can be as follows according to the distance metric Documents Similarity of probability distribution between two document subject matters, calculation formula：

Finally, the multidimensional characteristic of user and the motif discovery algorithm (LDA) based on probability are organically combined, synthesis draws use The calculating formula of similarity at family is as follows：

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of from which, example all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention will by appended right Ask rather than described above limits, it is intended that all changes in the implication and scope of the equivalency of claim will be fallen Include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims

1. a kind of telecommunication user similarity based on LDA topic models finds method, it is characterised in that comprises the following steps：

S1：Gather user profile；

S2：The user profile gathered in S1 is pre-processed；

S3：Similarity is carried out respectively to pre-processing the base attribute in information, user's communication record and user's short message record in S2 Calculate；

S4：Connected base station position information in one day to pre-processing the user in information in S2, establish LDA models, calculate the letter Cease similarity；

S5：Comprehensive S3 and S4 phase knowledge and magnanimity calculate, thus it is speculated that correlation；

S6：Clustered with the correlation deduced in S5.

2. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In, the content pre-processed in the S2 to the user profile gathered in S1, including data scrubbing, data integration, data change Change, 4 steps of hough transformation.

3. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In, user's base attribute in the S3, it is following 14 attributes, including：Whether spending amount, online duration, sex are unknown, property Be not whether female, sex whether be man, whether urban district, whether county town, whether rural area, spending amount whether between 0~100, disappear Take the amount of money whether between 100~200, spending amount whether between 200~300, spending amount whether between 300~500, consumption The amount of money whether between 500~1000, spending amount whether be more than 1000.

4. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In user's base attribute calculating formula of similarity is as follows in the S3：

Wherein,User a N-dimensional characteristic vector is represented,User b N-dimensional characteristic vector is represented,WithRepresent respectively to The length of amount,For the similitude of user's base attribute, value is bigger, then the similar features in user's base attribute are just It is more.

5. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In the calculation formula of user's communication record and user's short message record similarity is as follows in the S3：

Wherein, P (C, S) is user's communication record and user's short message records similarity, and c represents the duration of call, and f represents call frequency Rate, behalf short message number；c_ijRepresent that user i initiates the duration of call, c to user j_jiRepresent that user j initiates to converse to user i Duration, c_iRepresent the call total duration of user i and neighboring user, c_jRepresent the call total duration of user j and neighboring user；f_ij Represent that user i initiates the frequency of call, f to user j_jiRepresent that user j initiates the frequency of call, f to user i_iRepresent user i with The call sum frequency of neighboring user, f_jRepresent the call sum frequency of user j and neighboring user；s_ijRepresent that user i initiates to user j The number of short message, s_jiRepresent that user j initiates the number of short message, s to user i_iThe short message for representing user i and neighboring user is always secondary Number, s_jRepresent user j and the short message total degree of neighboring user.

6. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature exists In user connected base station position information in one day in the S4, established LDA models, and the step of calculating the information similarity is：

S41:It is default before modeling；

S42：Build LDA models；

S43：Parameter Estimation is carried out using Gibbs model method, is distributed, calculated by calculating theme probability distribution and Word probability Documents Similarity.

7. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In, it is default before being modeled in the S41, comprise the following steps：

S411：4 kinds of labels are sticked for some regional base station location：It is home location base station respectively, job site base station, other Base station, any connection request base station is not received；The implication of this 4 kinds of labels is respectively：At user at home, user is in work State, user are being in off-mode from address and job site remote position, user mobile phone；

S412：It was divided into every 20 minutes time blocks by one day, builds a vectorial particulate being made up of 72 location tags The location expression of degree；It was divided into 8 timeslices by one day again, is respectively：0~6am, 6~9am, 9~12am, 12~2pm, 2~ 5pm, 5~7pm, 7~9pm, 9-12pm, numbering are 0~7, build the time description of a coarseness；

S413：Place transfer corpus is built, a lexical item in corpus under some document includes the particulate in continuous 2 hours Spend location tags and a coarseness time tag.

8. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In structure LDA models, comprise the following steps in the S42：

S421：Select document i theme probability distribution forWhereinRepresent i-th document matrix, Dir tables Showing that Di Li Crays are distributed, i belongs to { 1 ... M }, and M is document number,It is the prior distribution of the theme distribution of every document The parameter of Dirichlet distributions, also referred to as hyper parameter；

S422：Select theme k lexical item probability distribution forWhereinRepresent k-th of theme matrix, Dir tables Showing that Di Li Crays are distributed, k belongs to { 1 ... K }, and K is theme number,It is the prior distribution of the word distribution of each theme The parameter of Dirichlet distributions, also referred to as hyper parameter；

S423：For each word w in document_i,j, select a theme z_i,j~Multinomial (θ_i) obey multinomial point Cloth；Select a lexical itemObey multinomial distribution；

Wherein, w_i,jRepresent j-th of lexical item, z under i-th of document_i,jRepresent the theme numbering of j-th of lexical item under i-th of document, θ_i I-th document is represented,Represent theme z_i,jDistribution.

9. a kind of telecommunication user similarity based on LDA topic models according to claim 6 finds method, its feature exists In Documents Similarity formula is as follows in the S43：

Wherein, d₁、d₂Two documents are represented, i represents i-th of theme numbering,Represent document d₁Theme i probability is got,Represent document d₂Theme i probability is got, K represents theme sum in m piece documents.

10. a kind of telecommunication user similarity based on LDA topic models according to claim 1 finds method, its feature It is, the formula that comprehensive similarity calculates in the S5 is as follows：

Wherein, u1, u2 represent user 1 and user 2；η₁The weights calculated using base attribute are represented, η is set₁=0.1；η₂Represent The weights calculated using message registration and short message record, set η₂=0.3；η₃Represent and connect base station location in one day using user The weights that information calculates, η is set₃=0.6.