CN105302866A

CN105302866A - OSN community discovery method based on LDA Theme model

Info

Publication number: CN105302866A
Application number: CN201510611455.1A
Authority: CN
Inventors: 曹玖新; 马卓; 陈巧云; 刘波; 周涛
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2015-09-23
Filing date: 2015-09-23
Publication date: 2016-02-03

Abstract

The invention discloses an online social network (short for OSN) community discovery method based on a Latent Dirichlet Allocation (short for LDA) theme model. The method comprises the following steps first pre-processing data, building an LDA theme model (including an LDA-F model and an LDA-T model) based on a relationship between a user in the online social network and other friends and word information expressed by the user to solve a model probability distribution, then estimating parameters via a Gibbs sampling algorithm, and at last discovering an OSN community according to the estimated parameters. By the use of the OSN community discovery method based on the LDA Theme model, a corresponding probability model can be achieved based on user blog semantic information discovery without the use of information connection via the network topology; blog content semantic similarities are introduced to effectively describe user interest and hobby probability distribution conditions; and with the introduction of community internal topological connection tightness, communities with close internal topological connections can be discovered.

Description

OSN community discovery method based on LDA topic model

Technical Field

The invention relates to an Online Social Network (OSN) community discovery mechanism utilizing an LDA topic model, belonging to the field of social computing, in particular to the field of community discovery.

Background

With the rapid development of the internet, the network gradually changes from the original data-based core to the human-based core, which promotes the rapid development of the online social network. The online social network is different from the traditional interpersonal relationship network, not only has large-scale users and friend relationships thereof, but also has a large amount of text information spontaneously expressed by the users, which brings new vitality and challenge to community discovery work.

The traditional community discovery method is mainly based on connection, namely the topological structure of a graph, the method carries out community division by analyzing explicit connection among individuals, the connection among nodes in discovered communities is relatively tight, the connection among different communities is relatively sparse, but the method does not consider the theme characteristics of users. In the microblog, the tweet of the user usually implies the information of the interests, hobbies, behavior patterns and the like of the user, and the topic model used in the natural language processing can take the factors into consideration.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the community discovery method based on the LDA topic model provided by the invention can be used for obtaining a corresponding probability model by mining the microblog semantic information of a user on the basis of not depending on network topology connection information, and can be used for effectively describing the probability distribution condition of the interests and hobbies of the user by introducing the semantic similarity of microblog contents; introducing the compactness of topological connection in the community, and mining the community with relatively very tight topological connection in the community.

The technical scheme is as follows: in order to solve the above problems, the invention provides an OSN community discovery method based on an LDA topic model, which utilizes the relationship between users and friends thereof in an online social network and the character information spontaneously expressed by the users to perform the OSN community discovery process, and comprises the following steps:

1) preprocessing a data set, performing preprocessing work such as word segmentation, word pause removal, noise removal and the like on the microblog document of the original user, specifically, extracting [ uid, text ] fields of each record from the weibo data set, and classifying all microblogs according to the uid, wherein the format of each record is [ uid, text 1; text 2; … … ], performing word segmentation by using a Chinese lexical analysis system ICTCCLAS 2013 version of Chinese academy of sciences, and removing pause words and words (such as URL (uniform resource locator), punctuation marks, Chinese language words and the like) which have no practical significance to the model and microblog emoticons in the word segmentation process; performing user relationship bidirectional processing on a followers data set in a document for recording user relationship and eliminating users without friends, wherein the format of each record is [ user, friend 1; friend 2; ... ];

2) the method comprises the steps that an LDA theme model is built according to established community elements, and comprises a theme model LDA-T built based on semantic similarity of microblog content in a community and a theme model LDA-F built based on topological connection closeness, a term set in the LDA-T is a set formed by terms in all tweets of users, a document set is a set formed by the tweets of all users, the theme is a set of the community, the term set in the LDA-F is a set formed by all friends of the users, the document set is a set formed by all users, and the theme is a set of the community;

3) according to the models LDA-T and LDA-F obtained in the step 2, Dirichlet distribution is applied to the topic probability distribution under the document and the term probability distribution under the topic, and combined probability distribution p (w) based on the hyper-parameters is generated_m，z_m，θ_mΦ | α), where α and β are hyper-parameters of the Dirichlet distribution, w_mSet representing all terms in the mth document, z_mSet representing topics corresponding to all terms in the mth document, θ_mRepresenting the topic probability distribution of the mth document, wherein phi represents a set of term probability distributions under all topics;

4) estimating the probability distribution theta of the theme when giving the document by utilizing a Gibbs sampling algorithm according to the joint probability distribution obtained in the step 3_mAnd probability distribution of terms given a topic

5) And acquiring communities according to the parameters obtained in the step 4.

The generation process and parameters of the document in the LDA model in the step 2 are defined as follows:

1) for each topic K ∈ [1, K]Sampling term probability distributions for topic k

2) For each document M ∈ [1, M]Sampling the topic probability distribution θ of document m_m～Dir(α)；

3) For each document M ∈ [1, M]Length N of sample document m_m～Poiss(ξ)；

4) For term N ∈ [1, N ] in each document m_m]Selecting an implicit topic z_m，n～Mult(θ_m) Generating a term

Wherein N is_mThe number of terms contained in the mth document is shown, K is the number of topics, M is the number of documents, and α, β and ξ are parameters of probability distribution.

The joint probability distribution generated in step 3 is:

wherein, w_mSet representing all terms in the mth document, z_mSet representing topics corresponding to all terms in the mth document, θ_mRepresenting the topic probability distribution of the mth document,. phi.represents the set of term probability distributions under all topics, α and β are hyper-parameters of the Dirichlet distribution, w_m，nN term, z, representing the m document_m.nRepresenting the topic corresponding to the nth term in the mth document, N_mIndicating the number of terms contained in the mth document.

In the step 4, the known term set is required to apply the Gibbs sampling algorithm to the LDA modelThe prior Dirichlet distribution parameter α and the topic number K finally obtain the probability distribution theta of the topic in the given document and the probability distribution of the term in the given topicThe calculation method comprises the following steps:

θ_{m, k} = \frac{n_{m}^{(k)} + α_{k}}{Σ_{k = 1}^{K} (n_{m}^{(k)} + α_{k})} - - - (2)

wherein, theta_m，kTo representGiven the probability of a topic being k for document m,representing the number of times the topic k appears in the document m, α ═ α₁，α₂，…，α_mHyper-parameters for M-dimensional Dirichlet distribution, α_kFor positive real numbers, the pair parameter theta is reflected_mK is the number of topics in the document m;representing the probability of the term being t given a topic k,representing the number of occurrences of term t in topic k, β ═ β₁，β₂，…，β_kHyper-parameters for K-dimensional Dirichlet distribution, β_tFor positive real number, the pair of parameters are reflectedV is the number of terms in the topic k.

Has the advantages that: the invention adopts the technical scheme, and has the following advantages:

1) the semantic similarity of the microblog contents is introduced, and the probability distribution condition of the interests and hobbies of the user is effectively described;

2) introducing the compactness of internal topological connection of the community, and excavating the community with relatively very tight internal topological connection;

3) the method comprises the steps that a traditional community discovery method is improved by using a topic model, and a corresponding probability model is obtained by mining user microblog semantic information on the basis of not depending on network topology connection information;

4) the Gibbs sampling algorithm is utilized to carry out parameter estimation on the model, and compared with two algorithms for parameter estimation, namely variational reasoning and EM algorithm, the algorithm can process the situation of complex distribution more simply and rapidly;

5) and a data set preprocessing mechanism is introduced, so that the accuracy of community discovery results is ensured.

Drawings

FIG. 1 is a diagram of an LDA topic model of the present invention;

FIG. 2 is a flow chart of the Gibbs sampling algorithm of the present invention;

FIG. 3 is a flow chart of community discovery according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as interpreted by those skilled in the art.

An OSN community discovery method based on an LDA topic model comprises the steps of firstly, preprocessing a data set; establishing an LDA topic model (comprising an LDA-F model and an LDA-T model) by utilizing the relationship between the users and friends thereof in the online social network and the character information spontaneously expressed by the users, and solving the probability distribution of the model; then, parameter estimation is carried out by utilizing a Gibbs sampling algorithm; and finally, carrying out OSN community discovery according to the estimated parameters, specifically comprising the following steps:

1) carrying out data set preprocessing:

1.1) dataset preprocessing of LDA-F models

Because the friend relationship defined by the LDA-F model must be a bidirectional edge, the user relationship bidirectional processing needs to be carried out on the followers data set, users without friends are eliminated, and the format of each record is [ user, friend 1; friend 2; … … ];

1.2) dataset preprocessing of LDA-T model

Extracting [ uid, text ] fields of each record from the weibo data set, and classifying all microblogs according to the uid, wherein the format of each record is [ uid, text 1; text 2; … … ]; the word segmentation is carried out on the corpus of the LDA-T model by using a Chinese lexical analysis system ICTCCLAS 2013 version of Chinese academy of sciences, and in the word segmentation process, stop words and words (such as URLs, punctuations, word elements and the like) which have no practical significance to the model are removed, and microblog emoticons are removed.

2) Solving the probability distribution of the model:

the topic model LDA-T constructed based on the semantic similarity of the microblog content in the community and the topic model LDA-F constructed based on the topological connection compactness in the community belong to the LDA model.

In a topic model LDA-T constructed based on semantic similarity of microblog content in a community, a term set is a set formed by terms in all tweets of users, a document set is a set formed by the tweets of all users, and a topic is a set of the community; in a topic model LDA-F constructed based on the intra-community topological connection closeness, a term set is a set formed by all friends of a user, a document set is a set formed by all users, and a topic is a set of communities.

For an LDA model with M documents and K subjects, the generation process and parameters of the documents in the specific LDA model are defined as follows:

2.1) for each topic K ∈ [1, K]Sampling term probability distributions for topic k

2.2) for each document M ∈ [1, M]Sampling the topic probability distribution θ of document m_m～Dir(α)；

2.3) for each document M ∈ [1, M]Length N of sample document m_m～Poiss(ξ)；

2.4) term N ∈ [1, N) for each document m_m]Selecting an implicit topic z_m，n～Mult(θ_m) Generating a term

Wherein N is_mThe number of terms included in the mth document is represented, and α, β, and ξ are parameters of probability distribution.

According to the generated LDA model document, Dirichlet distribution is applied to the topic probability distribution under the document and the term probability distribution under the topic, and a combined probability distribution p (w) based on the hyper-parameters is generated_m，z_m，θ_m，Φ|α，β)：

3) Parameter estimation using gibbs sampling:

estimating the parameter theta and theta from the subject variable z using the Gibbs sampling algorithmFor an LDA model to use Gibbs sampling algorithm, a known term set is neededParameters of prior Dirichlet distributionsα and the number of subjects K finally determine the parameters theta and theta to be estimatedWhere θ is the probability distribution of the topic given the document, the calculation method is shown in equation 2,the probability distribution of terms for a given topic is calculated as shown in equation 3:

θ_{m, k} = \frac{n_{m}^{(k)} + α_{k}}{Σ_{k = 1}^{K} (n_{m}^{(k)} + α_{k})} - - - (2)

wherein, theta_m，kRepresenting the probability of a topic k given a document m,representing the number of times the topic k appears in the document m, α ═ α₁，α₂，…，α_mHyper-parameters for M-dimensional Dirichlet distribution, α_kFor positive real numbers, the pair parameter theta is reflected_mK is the number of topics in the document m;representing the probability of the term being t given a topic k,representing the number of occurrences of term t in topic k, β ═ β₁，β₂，…，β_kHyper-parameters for K-dimensional Dirichlet distribution, β_tFor positive real number, the pair of parameters are reflectedV is the number of terms in the topic k. The specific gibbs sampling algorithm is as follows:

3.1) initializing Global variablesn_kAnd n_mWhereinrepresenting the number of times the term t appears in the topic k,indicates the number of times the topic k appears in the document m, n_kIs composed ofSum of (1), n_mIs composed ofThe sum of (a);

3.2) for each document M ∈ [1, M]Term N ∈ [1, N ] in (1)_m]Sampling a subject z_m，nLet K to Mult (1/K), let global variablen_kAnd n_mRespectively carrying out self-increment operation;

3.3) skipping to the step 3.2 until all the documents are circularly traversed, and skipping to the step 3.4 to start iteration after the circulation traversal is finished;

3.4) for each document M ∈ [1, M]Term N ∈ [1, N ] in (1)_m]Make a global variablen_kAnd n_mSeparately performing a self-subtraction operation, and then sampling the subjectThen make the global variablen_kAnd n_mRespectively carrying out self-increment operation;

3.5) jump to step 3.4 until the number of iterations I is reached.

Furthermore, as mentioned in step 3.4 Is the gibbs sampling formula of the LDA model.

4) From the resulting parameter, the probability distribution θ of the topic given the document_mPractical significance in LDA-T model and LDA-F model, the parameter theta is known_mThe actual meanings of are givenAnd (4) probability distribution of communities when the user is used, thereby obtaining the communities represented in the form of probability distribution.

Claims

1. An OSN community discovery method based on an LDA topic model is characterized in that the OSN community discovery process is carried out by utilizing the relationship between users and friends thereof in an online social network and character information spontaneously expressed by the users, and comprises the following steps:

1) preprocessing a data set, performing preprocessing work such as word segmentation, word pause removal, noise removal and the like on a microblog document of an original user, performing user relationship bidirectional processing on a followers data set in a document recording user relationship, and eliminating users without friends;

3) according to the models LDA-T and LDA-F obtained in the step 2, Dirichlet distribution is applied to the topic probability distribution under the document and the term probability distribution under the topic, and combined probability distribution p (w) based on the hyper-parameters is generated_m,z_m,θ_mΦ | α), where α and β are hyper-parameters of the Dirichlet distribution, w_mSet representing all terms in the mth document, z_mSet representing topics corresponding to all terms in the mth document, θ_mRepresenting the topic probability distribution of the mth document, wherein phi represents a set of term probability distributions under all topics;

2. The OSN community discovery method based on LDA topic model as claimed in claim 1, wherein the noise removed in step 1 comprises URL, punctuation mark, tone word and emoticon.

3. The OSN community discovery method based on LDA topic model as claimed in claim 1, wherein the generation process and parameters of the document in LDA model in step 2 are defined as follows:

1) for each topic K ∈ [1, K]To adoptTerm probability distribution of sample topic k

3) For each document M ∈ [1, M]Length N of sample document m_m～Poiss(ξ)；

4) For term N ∈ [1, N ] in each document m_m]Selecting an implicit topic z_m,n～Mult(θ_m) Generating a term

4. The OSN community discovery method based on LDA topic model as claimed in claim 3, wherein the joint probability distribution generated in step 3 is:

wherein, w_mSet representing all terms in the mth document, z_mSet representing topics corresponding to all terms in the mth document, θ_mRepresenting the topic probability distribution of the mth document,. phi.represents the set of term probability distributions under all topics, α and β are hyper-parameters of the Dirichlet distribution, w_m,nN term, z, representing the m document_m.nRepresenting the topic corresponding to the nth term in the mth document, N_mIndicating the number of terms contained in the mth document.

5. The OSN community discovery method based on LDA topic model as claimed in claim 4, wherein the probability distribution of topics given the document in step 4 is calculated by:

θ_{m, k} = \frac{n_{m}^{(k)} + α_{k}}{Σ_{k = 1}^{K} (n_{m}^{(k)} + α_{k})} - - - (2)

wherein, theta_m,kRepresenting the probability of a topic k given a document m,indicating the number of times the topic k appears in the document m, α ═<α₁,α₂,…,α_m>Hyper-parameter for M-dimensional Dirichlet distribution, α_kFor positive real numbers, the pair parameter theta is reflected_mK is the number of topics in the document m.

6. The OSN community discovery method based on LDA topic model as claimed in claim 4, wherein the calculation method of the probability distribution of terms in the step 4 given the topic is:

wherein,representing the probability of the term being t given a topic k,indicating the number of occurrences of term t in topic k, β ═<β₁,β₂,…,β_k>Hyper-parameter for K-dimensional Dirichlet distribution, β_tFor positive real number, the pair of parameters are reflectedV is the number of terms in the topic k.