CN112632215A

CN112632215A - Community discovery method and system based on word-pair semantic topic model

Info

Publication number: CN112632215A
Application number: CN202011383171.9A
Authority: CN
Inventors: 刘洪涛; 王宁
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-04-09

Abstract

The invention requests to protect a community discovery method based on a word pair semantic topic model, which comprises the following steps: (1) firstly, preprocessing an acquired short text (microblog, bullet screen, Twitter and the like) data set; (2) then, establishing a BTM-based topic model (including a BTM-R model and a BTM-W model) according to the social relationship of the users in the social network and the short text information published by the users, and solving the probability distribution of the model; (3) secondly, parameter estimation is carried out according to a Gibbs sampling algorithm; (4) and finally, carrying out community discovery according to the estimated parameters and a community discovery algorithm. The method provided by the invention obtains a corresponding probability model by mining the semantic information of the short text sent by the user, and simultaneously introduces the semantic similarity of the text information content to obtain the probability distribution condition of the interests and hobbies of the user; meanwhile, the social relationship of the users in the community is introduced, so that the users with close relationship in the social network are mined, and the purpose of community discovery is achieved.

Description

Community discovery method and system based on word-pair semantic topic model

Technical Field

The invention belongs to the field of social computing, particularly relates to the field of community discovery, and particularly relates to a short text classification method based on a word pair semantic Topic Model (BTM for short).

Background

Since the network has entered our view, particularly after the full popularity of mobile devices, the relationships of people in real society have also responded being mapped into the network, what we say, an online social network. People in the modern science and technology age are basically unable to leave social media software, the software has information published by users in addition to the friend relationship between people, the information generally has strict character length limitation, and brand new opportunity is brought to community discovery work.

The traditional community discovery method is mainly based on a network topological structure, namely, a connection relation in graph theory, the method is designed to divide communities by analyzing the relation among individuals, the discovered communities have compact internal relation, the relation among different community intervals is sparse, and the method does not consider the attributes of users. On a social media software, the short text information issued by the user implies the interests and hobbies, browsing habits and the like of the user, and the topic model used in the natural language processing can extract and mine the interests and hobbies of the user according to the text information issued by the user.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A community discovery method based on a word pair semantic topic model is provided. The technical scheme of the invention is as follows:

a community discovery method based on a word pair semantic topic model comprises the following steps:

s1, preprocessing the acquired short text data set, including preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, processing the relation data set in the acquired data set, including user relation processing and elimination of inactive users, and completing construction of a user topological structure;

s2, constructing a BTM topic model according to a given community label, wherein the BTM topic model comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W constructed based on semantic similarity of short text information content in a community, a document set in the BTM-R is a set formed by all users, a term set is a set formed by attention relations among the users, a topic is a set of the community, the document set in the BTM-W is a set formed by short text information issued by all the users, the term set is a set formed by pairwise combination of different terms in the short text information issued by the users, namely a term pair set, and the topic is a community set;

s3, according to the models BTM-R and BTM-W obtained in S2, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, and therefore a combined probability distribution based on the word pair b is obtained

Wherein alpha and beta are hyper-parameters of Dirichlet distribution, z represents a subject corresponding to a word pair,

the topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;

s4, estimating the probability distribution theta of the global theme when the short text information is given and the probability distribution phi of the terms when the theme is given by using a Gibbs sampling algorithm according to the joint probability distribution obtained in the S3;

and S5, carrying out community discovery according to the parameters obtained in the step S4 to obtain communities.

Further, in step S1, the preprocessing operation on the data includes the following steps:

preprocessing of the BTM-R model since the friend relationships defined by the BTM-R model must be of mutual interest, here bi-directional processing of interest relationships is required on the relationship data sets of users and users without friends are removed;

pretreatment of the BTM-W model

Short text information issued by each user is obtained from the data set, and non-text parts including html tags, non-English characters and punctuations, mood assisted words and loaned words are removed aiming at the short text information; then the jieba word segmentation is used on the BTM-W corpus.

Further, the step S2 specifically includes:

2. solving the probability distribution of the topic model:

2.1.1-for each topic, the term probability distribution for the sampling topic z is φ_z～Dir(β)；

2.1.2, for all the documents, the topic probability distribution of the sampling document is theta-Dir (alpha);

2.1.3, for each word pair, carrying out random theme distribution z-Multi (theta);

2.1.4 for all word pairs, extract two word probabilities of ω_i,ω_j～Multi(φ_z)；

2.1.5 the joint probability of a word pair is: p (b) ═ Σ_zP(z)P(ω_i|z)P(ω_j| z); p (z) is the probability distribution with a topic of z, P (ω)_i| z) is represented by ω_iProbability when belonging to topic z.

2.1.6 the probability of the entire corpus is: p (b) ═ Π_(i,j)∑_zθ_zφ_i|zφ_j|z；

Wherein, ω is_i，ω_jAre two elements in a word pair, respectively. Phi is a_i|zProbability distribution, phi, of word pair i when the topic is z_j|zProbability distribution of word pair j when subject is z, theta_zThe probability distribution when the topic is z, i and j respectively represent a word pair i and a word pair j.

Further, the generated joint probability distribution is:

3.1:

wherein, ω is_mRepresenting the set of all terms in the mth short text message, z_mRepresents the m < th >Topic sets, theta, corresponding to all terms in short text messages_mThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omega_m,nThe nth term, z, in the mth short text message of the representation_m,nTopic corresponding to the nth term in the represented mth short text message, N_mRepresenting the total number of terms contained in the mth document;

the probability p (z) of the nth term in the mth short text message is shown while the nth term in the mth short text message belongs to the subject z_m,n|θ_m) The probability of the topic z corresponding to the nth term in the mth short text information is shown based on the topic probability distribution of the mth short text information. p (theta)_m| α) represents the probability distribution of the topic in the mth short text message under the condition of the dirichlet distribution hyperparameter, and p (Φ | β) represents the probability distribution of the subordinate terms under all topics under the condition of the dirichlet distribution hyperparameter. Alpha, beta are represented by hyper-parameters of the Dirichlet distribution.

The gibbs sampling formula is:

3.2:

wherein the content of the first and second substances,

represented is a joint probability distribution based on word pairs b,

the assignment of topics to all word pairs is indicated. B is the set of all word pairs. n is_zIs the number of times that word pair b is assigned to topic z, n_ω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z.

Further, step S4 specifically includes the following steps:

according to the 3.2 Gibbs sampling formula, the probability distribution of the theme is calculated by combining the following formula:

4.1:

wherein, theta_m,zRepresenting the probability that the short text information m is titled z,

indicates the number of times the topic z appears in the short text message m, α ═ α (α)₁,α₂,…,α_m) Hyperparameters of the Dirichlet distribution in the m dimensions, alpha_kFor positive real numbers, the hyper-parameters α and β are estimated from the EM algorithm, which is not required here, and α is usually 50/k, k being the number of subjects, and β being 0.01. Here, empirical values, representing the parameter θ_mThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m;

the probability distribution of the subject term is calculated in the following way:

4.2:

wherein phi is_ω|zRepresenting the probability of a term being ω, n, given a topic z_ω|zIs the number of times the word ω is assigned to the subject z, β ═ β₁,β₂,…,β_z) Is a hyperparameter, beta, of the z-dimensional Dirichlet distribution_kFor positive real numbers, represents the parameter phi_zAnd M is the number of terms in the subject z, and the probability distribution parameters of the subject words are obtained.

Further, step S5 performs community discovery according to the parameters obtained in step S4 to obtain communities, which includes the following specific steps:

according to the obtained theta_m,zAnd phi_ω|zParameter, probability distribution theta of subject of given short text information_mThe actual significance of θ in the BTM-R model and the BTM-W model is known_mAll practical meanings ofGiven a probability distribution of a community of users, a community represented in the form of a probability distribution is thus obtained.

A word-pair semantic topic model based community discovery system comprising:

a preprocessing module: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for preprocessing an acquired short text data set, and comprises preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, processing a relation data set in the acquired data set, including user relation processing and elimination of inactive users, and completing construction of a user topological structure;

the BTM topic model building module: the BTM topic model is constructed according to a given community label, and comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W constructed based on semantic similarity of short text information content in a community, wherein a BTM-R Chinese document set is a set formed by all users, a term set is a set formed by attention relations among the users, a topic is a set of the community, in addition, the BTM-W Chinese document set is a set formed by short text information issued by all the users, the term set is a set formed by pairwise combination of different terms in the short text information issued by the users, namely a term pair, and the topic is a community set;

a joint probability distribution calculation module: according to the obtained models BTM-R and BTM-W, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, so that a combined probability distribution based on the word pair b is obtained

a community discovery module: the method is used for estimating the probability distribution theta of the global subject when the short text information is given and the probability distribution phi of the terms when the subject is given by using a Gibbs sampling algorithm according to the obtained joint probability distribution; and carrying out community discovery according to the obtained parameters to obtain communities.

The invention has the following advantages and beneficial effects:

the invention provides a community discovery method based on a word pair semantic topic model, which is characterized in that a topic model is obtained by introducing short text information published by a user on social media software according to an original community discovery method, the probability distribution condition of the interests and hobbies of the user is effectively mined, and the community discovery is carried out according to the interests of the user on the topological structure of the original social network. In contrast, the existing methods are all single, and many are methods based on text processing unilaterally or based on network topology, and the method combines the two methods to obtain a better effect.

Drawings

FIG. 1 is a diagram of a BTM topic model of the present invention in accordance with a preferred embodiment;

FIG. 2 is a flow chart of community discovery in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the invention provides a community discovery method based on a word pair semantic Topic Model (BTM for short), and FIG. 2 is a process of community discovery.

The invention provides a community discovery method of a short text-based topic model BTM, which is characterized in that a community discovery process is carried out by utilizing the attention relationship among users in a social network and short text information issued by the users, and comprises the following steps:

and S1, preprocessing the acquired short text data set, including preprocessing operations of removing non-text parts, word segmentation, stop word removal and the like of the short text document, and processing the relation data set in the acquired data set, including user relation processing and elimination of inactive users (users who do not send out a text for a long time or have no friends), so as to complete the construction of a user topological structure.

In an example, the sub-steps of implementing S1 are as follows:

1. preprocessing of the data set is performed:

preprocessing of the BTM-R model since the buddy relationships defined by the BTM-R model must be of mutual interest, here bi-directional processing of interest relationships on a user's relationship data set is required, and users without buddies are removed.

Pretreatment of the BTM-W model

Short text information issued by each user is obtained from the data set, and non-text parts including html tags, non-English characters, punctuation marks, mood assisted words, loaner words and the like are removed aiming at the short text information; then the jieba word segmentation is used on the BTM-W corpus.

S2, constructing a BTM topic model according to the given community label, wherein the BTM topic model comprises a BTM-R topic model based on the topology structure of community users and a topic model BTM-W constructed based on the semantic similarity of short text information content in the community. The BTM-R document set is a set formed by all users, the term set is a set formed by concern relations among the users, and the theme is a set of communities, in addition, the BTM-W document set is a set formed by short text information issued by all the users, the term set is a set formed by combining every two terms (word pairs) in the short text information issued by the users, and the theme is a set of communities;

Where α, β are hyper-parameters of Dirichlet distribution, and z represents a topic corresponding to a word pair，

in this example, the sub-steps of S2 and S3 are specifically implemented as follows:

2. solving the probability distribution of the topic model:

2.1.5 the joint probability of a word pair is: p (b) ═ Σ_zP(z)P(ω_i|z)P(ω_j|z)；

2.1.6 the probability of the entire corpus is: p (B) ═ pi_(i,j)∑_zθ_zφ_i|zφ_j|z；

Wherein, ω is_i，ω_jAre two elements in a word pair, respectively.

3. The joint probability distribution generated is:

3.1:

wherein, ω is_mRepresenting the set of all terms in the mth short text message, z_mRepresenting a theme set, theta, corresponding to all terms in the mth short text message_mThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omega_m,nThe nth term, z, in the mth short text message of the representation_m,nTopic corresponding to the nth term in the represented mth short text message, N_mRepresenting words contained in the mth documentThe total number of items.

The gibbs sampling formula is:

3.2:

wherein n is_zIs the number of times that word pair b is assigned to topic z, n_ω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z. Firstly, calculating joint probability distribution;

in this example, the sub-steps of implementing S4 are as follows:

4.1:

wherein, theta_z,kRepresenting the probability that the short text information m is titled z,

indicates the number of times the topic z appears in the short text message m, α ═ α (α)₁,α₂,…,α_m) Hyperparameters of the Dirichlet distribution in the m dimensions, alpha_kFor positive real numbers, the pair parameter theta is represented_mThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m.

4.2:

wherein phi is_ω|zRepresenting the probability of a term being ω, n, given a topic z_ω|zIs the number of times the word omega is assigned to the subject z,β＝(β₁,β₂,…,β_z) Is a hyperparameter, beta, of the z-dimensional Dirichlet distribution_kFor positive real numbers, represents the parameter phi_zM is the number of terms in the topic z. Obtaining probability distribution parameters of the theme and probability distribution parameters of the theme words;

and S5, carrying out community discovery according to the parameters obtained in the step 4 to obtain communities.

In this example, the sub-steps of implementing S5 are as follows:

according to the obtained parameters, the probability distribution theta of the subject of the short text information_mThe actual significance of θ in the BTM-R model and the BTM-W model is known_mThe actual meaning of (c) is the probability distribution of a given user community, thereby obtaining a community represented in the form of a probability distribution.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A community discovery method based on a word pair semantic topic model is characterized by comprising the following steps:

s1: preprocessing the acquired short text data set, including preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, and processing a relation data set in the acquired data set, including user relation processing and elimination of inactive users, so as to complete construction of a user topological structure;

s2: the method comprises the steps that a BTM topic model is built according to a given community label, and the BTM topic model comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W built based on semantic similarity of short text information content in a community, wherein a BTM-R Chinese document set is a set formed by all users, a term set is a set formed by concern relations among the users, a topic is a set of the community, in addition, the BTM-W Chinese document set is a set formed by short text information issued by all the users, the term set is a set of pairwise combinations among different terms in the short text information issued by the users, namely the term set, and the topic is a community set;

s3: according to the models BTM-R and BTM-W obtained in S2, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, so that a combined probability distribution based on the word pair b is obtained

s4: estimating the probability distribution theta of the global theme when the short text information is given and the probability distribution phi of the terms when the theme is given by using a Gibbs sampling algorithm according to the joint probability distribution obtained in the S3;

s5: and performing community discovery according to the parameters obtained in the step S4 to acquire communities.

2. The method for community discovery based on word pair semantic topic model as claimed in claim 1, wherein the preprocessing operation on data in step S1 includes the following steps:

pretreatment of the BTM-R model

Since the friend relationships defined by the BTM-R model must be attention to each other, a bidirectional processing of attention relationships on the relationship data sets of users is required here, and users without friends are removed;

pretreatment of the BTM-W model

3. The method for community discovery based on word pair semantic topic model according to claim 2, wherein the step S2 specifically comprises:

2. solving the probability distribution of the topic model:

2.1.1: for each topic, the term probability distribution of the sampling topic z is phi_z～Dir(β)；

2.1.2: for all the documents, the topic probability distribution of the sampling document is theta-Dir (alpha);

2.1.3: for each word pair, performing random topic assignment z-Multi (theta);

2.1.4: for all word pairs, extracting the probability of two words as omega_i，ω_j～Multi(φ_z)；

2.1.5: the joint probability of a word pair is: p (b) ═ Σ_zP(z)P(ω_i|z)P(ω_j| z); p (z) is the probability distribution with a topic of z, P (ω)_i| z) is represented by ω_iProbability when belonging to topic z;

2.1.6: the probability of the entire corpus is: p (b) ═ Π_(i，j)∑_zθ_zφ_i|zφ_j|z；

4. The method of claim 3, wherein the generated joint probability distribution is:

3.1：

wherein, ω is_mAll of the m-th short text informationSet of terms, z_mRepresenting a theme set, theta, corresponding to all terms in the mth short text message_mThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omega_m，nThe nth term, z, in the mth short text message of the representation_m，nTopic corresponding to the nth term in the represented mth short text message, N_mRepresenting the total number of terms contained in the mth document;

the probability p (z) of the nth term in the mth short text message is shown while the nth term in the mth short text message belongs to the subject z_m，n|θ_m) The probability of the topic z corresponding to the nth term in the mth short text information is shown based on the topic probability distribution of the mth short text information. p (theta)_m| α) represents the probability distribution of the topic in the mth short text message under the condition of the dirichlet distribution hyperparameter, and p (Φ | β) represents the probability distribution of the subordinate terms under all topics under the condition of the dirichlet distribution hyperparameter. Alpha and beta represent hyper-parameters of Dirichlet distribution;

the gibbs sampling formula is:

3.2：

wherein the content of the first and second substances,

represented is a joint probability distribution based on word pairs b,

representing the topic assignment of all word pairs, B being the set of all word pairs, n_zIs the number of times that word pair b is assigned to topic z, n_ω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z.

5. The method for community discovery based on word pair semantic topic model as claimed in claim 4, wherein step S4 specifically comprises the following steps:

4.1：

wherein, theta_m，zRepresenting the probability that the short text information m is titled z,

indicates the number of times the topic z appears in the short text message m, α ═ α (α)₁，α₂，…，α_m) Hyperparameters of the Dirichlet distribution in the m dimensions, alpha_kFor positive real numbers, the hyperparameters a and β are estimated from the EM algorithm, which is not required here, a usually takes 50/k, k is the number of subjects, and β takes 0.01. Here, empirical values, representing the parameter θ_mThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m;

4.2：

wherein phi is_ω|zRepresenting the probability of a term being ω, n, given a topic z_ω|zIs the number of times the word ω is assigned to the subject z, β ═ β₁，β₂，…，β_z) Is a hyperparameter, beta, of the z-dimensional Dirichlet distribution_kFor positive real numbers, represents the parameter phi_zAnd M is the number of terms in the subject z, and the probability distribution parameters of the subject words are obtained.

6. The method for discovering community based on word pair semantic topic model as claimed in claim 5, wherein step S5 performs community discovery according to the parameters obtained in step S4 to obtain community, the specific steps are as follows:

according to the obtained theta_m，zAnd phi_ω|zParameter, probability distribution theta of subject of given short text information_mThe actual significance of θ in the BTM-R model and the BTM-W model is known_mThe actual meaning of (c) is the probability distribution of a given user community, thereby obtaining a community represented in the form of a probability distribution.

7. A system for community discovery based on a word-pair semantic topic model, comprising: