CN112632215A - Community discovery method and system based on word-pair semantic topic model - Google Patents

Community discovery method and system based on word-pair semantic topic model Download PDF

Info

Publication number
CN112632215A
CN112632215A CN202011383171.9A CN202011383171A CN112632215A CN 112632215 A CN112632215 A CN 112632215A CN 202011383171 A CN202011383171 A CN 202011383171A CN 112632215 A CN112632215 A CN 112632215A
Authority
CN
China
Prior art keywords
topic
probability distribution
short text
btm
community
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011383171.9A
Other languages
Chinese (zh)
Inventor
刘洪涛
王宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011383171.9A priority Critical patent/CN112632215A/en
Publication of CN112632215A publication Critical patent/CN112632215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention requests to protect a community discovery method based on a word pair semantic topic model, which comprises the following steps: (1) firstly, preprocessing an acquired short text (microblog, bullet screen, Twitter and the like) data set; (2) then, establishing a BTM-based topic model (including a BTM-R model and a BTM-W model) according to the social relationship of the users in the social network and the short text information published by the users, and solving the probability distribution of the model; (3) secondly, parameter estimation is carried out according to a Gibbs sampling algorithm; (4) and finally, carrying out community discovery according to the estimated parameters and a community discovery algorithm. The method provided by the invention obtains a corresponding probability model by mining the semantic information of the short text sent by the user, and simultaneously introduces the semantic similarity of the text information content to obtain the probability distribution condition of the interests and hobbies of the user; meanwhile, the social relationship of the users in the community is introduced, so that the users with close relationship in the social network are mined, and the purpose of community discovery is achieved.

Description

Community discovery method and system based on word-pair semantic topic model
Technical Field
The invention belongs to the field of social computing, particularly relates to the field of community discovery, and particularly relates to a short text classification method based on a word pair semantic Topic Model (BTM for short).
Background
Since the network has entered our view, particularly after the full popularity of mobile devices, the relationships of people in real society have also responded being mapped into the network, what we say, an online social network. People in the modern science and technology age are basically unable to leave social media software, the software has information published by users in addition to the friend relationship between people, the information generally has strict character length limitation, and brand new opportunity is brought to community discovery work.
The traditional community discovery method is mainly based on a network topological structure, namely, a connection relation in graph theory, the method is designed to divide communities by analyzing the relation among individuals, the discovered communities have compact internal relation, the relation among different community intervals is sparse, and the method does not consider the attributes of users. On a social media software, the short text information issued by the user implies the interests and hobbies, browsing habits and the like of the user, and the topic model used in the natural language processing can extract and mine the interests and hobbies of the user according to the text information issued by the user.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A community discovery method based on a word pair semantic topic model is provided. The technical scheme of the invention is as follows:
a community discovery method based on a word pair semantic topic model comprises the following steps:
s1, preprocessing the acquired short text data set, including preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, processing the relation data set in the acquired data set, including user relation processing and elimination of inactive users, and completing construction of a user topological structure;
s2, constructing a BTM topic model according to a given community label, wherein the BTM topic model comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W constructed based on semantic similarity of short text information content in a community, a document set in the BTM-R is a set formed by all users, a term set is a set formed by attention relations among the users, a topic is a set of the community, the document set in the BTM-W is a set formed by short text information issued by all the users, the term set is a set formed by pairwise combination of different terms in the short text information issued by the users, namely a term pair set, and the topic is a community set;
s3, according to the models BTM-R and BTM-W obtained in S2, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, and therefore a combined probability distribution based on the word pair b is obtained
Figure BDA0002810168880000021
Wherein alpha and beta are hyper-parameters of Dirichlet distribution, z represents a subject corresponding to a word pair,
Figure BDA0002810168880000022
the topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;
s4, estimating the probability distribution theta of the global theme when the short text information is given and the probability distribution phi of the terms when the theme is given by using a Gibbs sampling algorithm according to the joint probability distribution obtained in the S3;
and S5, carrying out community discovery according to the parameters obtained in the step S4 to obtain communities.
Further, in step S1, the preprocessing operation on the data includes the following steps:
preprocessing of the BTM-R model since the friend relationships defined by the BTM-R model must be of mutual interest, here bi-directional processing of interest relationships is required on the relationship data sets of users and users without friends are removed;
pretreatment of the BTM-W model
Short text information issued by each user is obtained from the data set, and non-text parts including html tags, non-English characters and punctuations, mood assisted words and loaned words are removed aiming at the short text information; then the jieba word segmentation is used on the BTM-W corpus.
Further, the step S2 specifically includes:
2. solving the probability distribution of the topic model:
2.1.1-for each topic, the term probability distribution for the sampling topic z is φz~Dir(β);
2.1.2, for all the documents, the topic probability distribution of the sampling document is theta-Dir (alpha);
2.1.3, for each word pair, carrying out random theme distribution z-Multi (theta);
2.1.4 for all word pairs, extract two word probabilities of ωij~Multi(φz);
2.1.5 the joint probability of a word pair is: p (b) ═ ΣzP(z)P(ωi|z)P(ωj| z); p (z) is the probability distribution with a topic of z, P (ω)i| z) is represented by ωiProbability when belonging to topic z.
2.1.6 the probability of the entire corpus is: p (b) ═ Π(i,j)zθzφi|zφj|z
Wherein, ω isi,ωjAre two elements in a word pair, respectively. Phi is ai|zProbability distribution, phi, of word pair i when the topic is zj|zProbability distribution of word pair j when subject is z, thetazThe probability distribution when the topic is z, i and j respectively represent a word pair i and a word pair j.
Further, the generated joint probability distribution is:
3.1:
Figure BDA0002810168880000031
wherein, ω ismRepresenting the set of all terms in the mth short text message, zmRepresents the m < th >Topic sets, theta, corresponding to all terms in short text messagesmThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omegam,nThe nth term, z, in the mth short text message of the representationm,nTopic corresponding to the nth term in the represented mth short text message, NmRepresenting the total number of terms contained in the mth document;
Figure BDA0002810168880000032
the probability p (z) of the nth term in the mth short text message is shown while the nth term in the mth short text message belongs to the subject zm,nm) The probability of the topic z corresponding to the nth term in the mth short text information is shown based on the topic probability distribution of the mth short text information. p (theta)m| α) represents the probability distribution of the topic in the mth short text message under the condition of the dirichlet distribution hyperparameter, and p (Φ | β) represents the probability distribution of the subordinate terms under all topics under the condition of the dirichlet distribution hyperparameter. Alpha, beta are represented by hyper-parameters of the Dirichlet distribution.
The gibbs sampling formula is:
3.2:
Figure BDA0002810168880000033
wherein the content of the first and second substances,
Figure BDA0002810168880000034
represented is a joint probability distribution based on word pairs b,
Figure BDA0002810168880000035
the assignment of topics to all word pairs is indicated. B is the set of all word pairs. n iszIs the number of times that word pair b is assigned to topic z, nω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z.
Further, step S4 specifically includes the following steps:
according to the 3.2 Gibbs sampling formula, the probability distribution of the theme is calculated by combining the following formula:
4.1:
Figure BDA0002810168880000041
wherein, thetam,zRepresenting the probability that the short text information m is titled z,
Figure BDA0002810168880000042
indicates the number of times the topic z appears in the short text message m, α ═ α (α)12,…,αm) Hyperparameters of the Dirichlet distribution in the m dimensions, alphakFor positive real numbers, the hyper-parameters α and β are estimated from the EM algorithm, which is not required here, and α is usually 50/k, k being the number of subjects, and β being 0.01. Here, empirical values, representing the parameter θmThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m;
the probability distribution of the subject term is calculated in the following way:
4.2:
Figure BDA0002810168880000043
wherein phi isω|zRepresenting the probability of a term being ω, n, given a topic zω|zIs the number of times the word ω is assigned to the subject z, β ═ β12,…,βz) Is a hyperparameter, beta, of the z-dimensional Dirichlet distributionkFor positive real numbers, represents the parameter phizAnd M is the number of terms in the subject z, and the probability distribution parameters of the subject words are obtained.
Further, step S5 performs community discovery according to the parameters obtained in step S4 to obtain communities, which includes the following specific steps:
according to the obtained thetam,zAnd phiω|zParameter, probability distribution theta of subject of given short text informationmThe actual significance of θ in the BTM-R model and the BTM-W model is knownmAll practical meanings ofGiven a probability distribution of a community of users, a community represented in the form of a probability distribution is thus obtained.
A word-pair semantic topic model based community discovery system comprising:
a preprocessing module: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for preprocessing an acquired short text data set, and comprises preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, processing a relation data set in the acquired data set, including user relation processing and elimination of inactive users, and completing construction of a user topological structure;
the BTM topic model building module: the BTM topic model is constructed according to a given community label, and comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W constructed based on semantic similarity of short text information content in a community, wherein a BTM-R Chinese document set is a set formed by all users, a term set is a set formed by attention relations among the users, a topic is a set of the community, in addition, the BTM-W Chinese document set is a set formed by short text information issued by all the users, the term set is a set formed by pairwise combination of different terms in the short text information issued by the users, namely a term pair, and the topic is a community set;
a joint probability distribution calculation module: according to the obtained models BTM-R and BTM-W, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, so that a combined probability distribution based on the word pair b is obtained
Figure BDA0002810168880000051
Wherein alpha and beta are hyper-parameters of Dirichlet distribution, z represents a subject corresponding to a word pair,
Figure BDA0002810168880000052
the topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;
a community discovery module: the method is used for estimating the probability distribution theta of the global subject when the short text information is given and the probability distribution phi of the terms when the subject is given by using a Gibbs sampling algorithm according to the obtained joint probability distribution; and carrying out community discovery according to the obtained parameters to obtain communities.
The invention has the following advantages and beneficial effects:
the invention provides a community discovery method based on a word pair semantic topic model, which is characterized in that a topic model is obtained by introducing short text information published by a user on social media software according to an original community discovery method, the probability distribution condition of the interests and hobbies of the user is effectively mined, and the community discovery is carried out according to the interests of the user on the topological structure of the original social network. In contrast, the existing methods are all single, and many are methods based on text processing unilaterally or based on network topology, and the method combines the two methods to obtain a better effect.
Drawings
FIG. 1 is a diagram of a BTM topic model of the present invention in accordance with a preferred embodiment;
FIG. 2 is a flow chart of community discovery in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the invention provides a community discovery method based on a word pair semantic Topic Model (BTM for short), and FIG. 2 is a process of community discovery.
The invention provides a community discovery method of a short text-based topic model BTM, which is characterized in that a community discovery process is carried out by utilizing the attention relationship among users in a social network and short text information issued by the users, and comprises the following steps:
and S1, preprocessing the acquired short text data set, including preprocessing operations of removing non-text parts, word segmentation, stop word removal and the like of the short text document, and processing the relation data set in the acquired data set, including user relation processing and elimination of inactive users (users who do not send out a text for a long time or have no friends), so as to complete the construction of a user topological structure.
In an example, the sub-steps of implementing S1 are as follows:
1. preprocessing of the data set is performed:
preprocessing of the BTM-R model since the buddy relationships defined by the BTM-R model must be of mutual interest, here bi-directional processing of interest relationships on a user's relationship data set is required, and users without buddies are removed.
Pretreatment of the BTM-W model
Short text information issued by each user is obtained from the data set, and non-text parts including html tags, non-English characters, punctuation marks, mood assisted words, loaner words and the like are removed aiming at the short text information; then the jieba word segmentation is used on the BTM-W corpus.
S2, constructing a BTM topic model according to the given community label, wherein the BTM topic model comprises a BTM-R topic model based on the topology structure of community users and a topic model BTM-W constructed based on the semantic similarity of short text information content in the community. The BTM-R document set is a set formed by all users, the term set is a set formed by concern relations among the users, and the theme is a set of communities, in addition, the BTM-W document set is a set formed by short text information issued by all the users, the term set is a set formed by combining every two terms (word pairs) in the short text information issued by the users, and the theme is a set of communities;
s3, according to the models BTM-R and BTM-W obtained in S2, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, and therefore a combined probability distribution based on the word pair b is obtained
Figure BDA0002810168880000071
Figure BDA0002810168880000072
Where α, β are hyper-parameters of Dirichlet distribution, and z represents a topic corresponding to a word pair,
Figure BDA0002810168880000073
The topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;
in this example, the sub-steps of S2 and S3 are specifically implemented as follows:
2. solving the probability distribution of the topic model:
2.1.1-for each topic, the term probability distribution for the sampling topic z is φz~Dir(β);
2.1.2, for all the documents, the topic probability distribution of the sampling document is theta-Dir (alpha);
2.1.3, for each word pair, carrying out random theme distribution z-Multi (theta);
2.1.4 for all word pairs, extract two word probabilities of ωij~Multi(φz);
2.1.5 the joint probability of a word pair is: p (b) ═ ΣzP(z)P(ωi|z)P(ωj|z);
2.1.6 the probability of the entire corpus is: p (B) ═ pi(i,j)zθzφi|zφj|z
Wherein, ω isi,ωjAre two elements in a word pair, respectively.
3. The joint probability distribution generated is:
3.1:
Figure BDA0002810168880000074
wherein, ω ismRepresenting the set of all terms in the mth short text message, zmRepresenting a theme set, theta, corresponding to all terms in the mth short text messagemThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omegam,nThe nth term, z, in the mth short text message of the representationm,nTopic corresponding to the nth term in the represented mth short text message, NmRepresenting words contained in the mth documentThe total number of items.
The gibbs sampling formula is:
3.2:
Figure BDA0002810168880000075
wherein n iszIs the number of times that word pair b is assigned to topic z, nω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z. Firstly, calculating joint probability distribution;
s4, estimating the probability distribution theta of the global theme when the short text information is given and the probability distribution phi of the terms when the theme is given by using a Gibbs sampling algorithm according to the joint probability distribution obtained in the S3;
in this example, the sub-steps of implementing S4 are as follows:
according to the 3.2 Gibbs sampling formula, the probability distribution of the theme is calculated by combining the following formula:
4.1:
Figure BDA0002810168880000081
wherein, thetaz,kRepresenting the probability that the short text information m is titled z,
Figure BDA0002810168880000082
indicates the number of times the topic z appears in the short text message m, α ═ α (α)12,…,αm) Hyperparameters of the Dirichlet distribution in the m dimensions, alphakFor positive real numbers, the pair parameter theta is representedmThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m.
The probability distribution of the subject term is calculated in the following way:
4.2:
Figure BDA0002810168880000083
wherein phi isω|zRepresenting the probability of a term being ω, n, given a topic zω|zIs the number of times the word omega is assigned to the subject z,β=(β12,…,βz) Is a hyperparameter, beta, of the z-dimensional Dirichlet distributionkFor positive real numbers, represents the parameter phizM is the number of terms in the topic z. Obtaining probability distribution parameters of the theme and probability distribution parameters of the theme words;
and S5, carrying out community discovery according to the parameters obtained in the step 4 to obtain communities.
In this example, the sub-steps of implementing S5 are as follows:
according to the obtained parameters, the probability distribution theta of the subject of the short text informationmThe actual significance of θ in the BTM-R model and the BTM-W model is knownmThe actual meaning of (c) is the probability distribution of a given user community, thereby obtaining a community represented in the form of a probability distribution.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (7)

1. A community discovery method based on a word pair semantic topic model is characterized by comprising the following steps:
s1: preprocessing the acquired short text data set, including preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, and processing a relation data set in the acquired data set, including user relation processing and elimination of inactive users, so as to complete construction of a user topological structure;
s2: the method comprises the steps that a BTM topic model is built according to a given community label, and the BTM topic model comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W built based on semantic similarity of short text information content in a community, wherein a BTM-R Chinese document set is a set formed by all users, a term set is a set formed by concern relations among the users, a topic is a set of the community, in addition, the BTM-W Chinese document set is a set formed by short text information issued by all the users, the term set is a set of pairwise combinations among different terms in the short text information issued by the users, namely the term set, and the topic is a community set;
s3: according to the models BTM-R and BTM-W obtained in S2, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, so that a combined probability distribution based on the word pair b is obtained
Figure FDA0002810168870000011
Wherein alpha and beta are hyper-parameters of Dirichlet distribution, z represents a subject corresponding to a word pair,
Figure FDA0002810168870000012
the topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;
s4: estimating the probability distribution theta of the global theme when the short text information is given and the probability distribution phi of the terms when the theme is given by using a Gibbs sampling algorithm according to the joint probability distribution obtained in the S3;
s5: and performing community discovery according to the parameters obtained in the step S4 to acquire communities.
2. The method for community discovery based on word pair semantic topic model as claimed in claim 1, wherein the preprocessing operation on data in step S1 includes the following steps:
pretreatment of the BTM-R model
Since the friend relationships defined by the BTM-R model must be attention to each other, a bidirectional processing of attention relationships on the relationship data sets of users is required here, and users without friends are removed;
pretreatment of the BTM-W model
Short text information issued by each user is obtained from the data set, and non-text parts including html tags, non-English characters and punctuations, mood assisted words and loaned words are removed aiming at the short text information; then the jieba word segmentation is used on the BTM-W corpus.
3. The method for community discovery based on word pair semantic topic model according to claim 2, wherein the step S2 specifically comprises:
2. solving the probability distribution of the topic model:
2.1.1: for each topic, the term probability distribution of the sampling topic z is phiz~Dir(β);
2.1.2: for all the documents, the topic probability distribution of the sampling document is theta-Dir (alpha);
2.1.3: for each word pair, performing random topic assignment z-Multi (theta);
2.1.4: for all word pairs, extracting the probability of two words as omegai,ωj~Multi(φz);
2.1.5: the joint probability of a word pair is: p (b) ═ ΣzP(z)P(ωi|z)P(ωj| z); p (z) is the probability distribution with a topic of z, P (ω)i| z) is represented by ωiProbability when belonging to topic z;
2.1.6: the probability of the entire corpus is: p (b) ═ Π(i,j)zθzφi|zφj|z
Wherein, ω isi,ωjAre two elements in a word pair, respectively. Phi is ai|zProbability distribution, phi, of word pair i when the topic is zj|zProbability distribution of word pair j when subject is z, thetazThe probability distribution when the topic is z, i and j respectively represent a word pair i and a word pair j.
4. The method of claim 3, wherein the generated joint probability distribution is:
3.1:
Figure FDA0002810168870000021
wherein, ω ismAll of the m-th short text informationSet of terms, zmRepresenting a theme set, theta, corresponding to all terms in the mth short text messagemThe probability distribution of the theme in the mth short text message is shown, phi represents the set of the probability distribution of the terms under all themes, and omegam,nThe nth term, z, in the mth short text message of the representationm,nTopic corresponding to the nth term in the represented mth short text message, NmRepresenting the total number of terms contained in the mth document;
Figure FDA0002810168870000022
the probability p (z) of the nth term in the mth short text message is shown while the nth term in the mth short text message belongs to the subject zm,nm) The probability of the topic z corresponding to the nth term in the mth short text information is shown based on the topic probability distribution of the mth short text information. p (theta)m| α) represents the probability distribution of the topic in the mth short text message under the condition of the dirichlet distribution hyperparameter, and p (Φ | β) represents the probability distribution of the subordinate terms under all topics under the condition of the dirichlet distribution hyperparameter. Alpha and beta represent hyper-parameters of Dirichlet distribution;
the gibbs sampling formula is:
3.2:
Figure FDA0002810168870000031
wherein the content of the first and second substances,
Figure FDA0002810168870000032
represented is a joint probability distribution based on word pairs b,
Figure FDA0002810168870000033
representing the topic assignment of all word pairs, B being the set of all word pairs, nzIs the number of times that word pair b is assigned to topic z, nω|zIs the number of times the word ω is assigned to the topic z, and M is the number of terms in the topic z.
5. The method for community discovery based on word pair semantic topic model as claimed in claim 4, wherein step S4 specifically comprises the following steps:
according to the 3.2 Gibbs sampling formula, the probability distribution of the theme is calculated by combining the following formula:
4.1:
Figure FDA0002810168870000034
wherein, thetam,zRepresenting the probability that the short text information m is titled z,
Figure FDA0002810168870000035
indicates the number of times the topic z appears in the short text message m, α ═ α (α)1,α2,…,αm) Hyperparameters of the Dirichlet distribution in the m dimensions, alphakFor positive real numbers, the hyperparameters a and β are estimated from the EM algorithm, which is not required here, a usually takes 50/k, k is the number of subjects, and β takes 0.01. Here, empirical values, representing the parameter θmThe prior knowledge, | B | is the total number of the word pair set, and K is the total number of the topics in the short text information m;
the probability distribution of the subject term is calculated in the following way:
4.2:
Figure FDA0002810168870000036
wherein phi isω|zRepresenting the probability of a term being ω, n, given a topic zω|zIs the number of times the word ω is assigned to the subject z, β ═ β1,β2,…,βz) Is a hyperparameter, beta, of the z-dimensional Dirichlet distributionkFor positive real numbers, represents the parameter phizAnd M is the number of terms in the subject z, and the probability distribution parameters of the subject words are obtained.
6. The method for discovering community based on word pair semantic topic model as claimed in claim 5, wherein step S5 performs community discovery according to the parameters obtained in step S4 to obtain community, the specific steps are as follows:
according to the obtained thetam,zAnd phiω|zParameter, probability distribution theta of subject of given short text informationmThe actual significance of θ in the BTM-R model and the BTM-W model is knownmThe actual meaning of (c) is the probability distribution of a given user community, thereby obtaining a community represented in the form of a probability distribution.
7. A system for community discovery based on a word-pair semantic topic model, comprising:
a preprocessing module: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for preprocessing an acquired short text data set, and comprises preprocessing work of removing non-text parts, word segmentation and stop words of the short text document, processing a relation data set in the acquired data set, including user relation processing and elimination of inactive users, and completing construction of a user topological structure;
the BTM topic model building module: the BTM topic model is constructed according to a given community label, and comprises a BTM-R topic model based on a community user topological structure and a topic model BTM-W constructed based on semantic similarity of short text information content in a community, wherein a BTM-R Chinese document set is a set formed by all users, a term set is a set formed by attention relations among the users, a topic is a set of the community, in addition, the BTM-W Chinese document set is a set formed by short text information issued by all the users, the term set is a set formed by pairwise combination of different terms in the short text information issued by the users, namely a term pair, and the topic is a community set;
a joint probability distribution calculation module: according to the obtained models BTM-R and BTM-W, Dirichlet distribution is used for the topic probability distribution of the document and the term probability distribution of the topic, so that a combined probability distribution based on the word pair b is obtained
Figure FDA0002810168870000041
Wherein alpha and beta are hyper-parameters of Dirichlet distribution, z represents a subject corresponding to a word pair,
Figure FDA0002810168870000042
the topic distribution of the whole word pair set with the word pair B removed is realized, and B is a set of all the word pairs;
a community discovery module: the method is used for estimating the probability distribution theta of the global subject when the short text information is given and the probability distribution phi of the terms when the subject is given by using a Gibbs sampling algorithm according to the obtained joint probability distribution; and carrying out community discovery according to the obtained parameters to obtain communities.
CN202011383171.9A 2020-12-01 2020-12-01 Community discovery method and system based on word-pair semantic topic model Pending CN112632215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011383171.9A CN112632215A (en) 2020-12-01 2020-12-01 Community discovery method and system based on word-pair semantic topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011383171.9A CN112632215A (en) 2020-12-01 2020-12-01 Community discovery method and system based on word-pair semantic topic model

Publications (1)

Publication Number Publication Date
CN112632215A true CN112632215A (en) 2021-04-09

Family

ID=75307157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011383171.9A Pending CN112632215A (en) 2020-12-01 2020-12-01 Community discovery method and system based on word-pair semantic topic model

Country Status (1)

Country Link
CN (1) CN112632215A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361270A (en) * 2021-05-25 2021-09-07 浙江工业大学 Short text optimization topic model method oriented to service data clustering
CN113378558A (en) * 2021-05-25 2021-09-10 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN114066669A (en) * 2021-10-28 2022-02-18 华南理工大学 Manufacturing service discovery method for cloud manufacturing
CN114491013A (en) * 2021-12-09 2022-05-13 重庆邮电大学 Topic mining method, storage medium and system for merging syntactic structure information
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102675A (en) * 2013-04-15 2014-10-15 中国人民大学 Method for detecting blogger interest community based on user relationship
US20150134402A1 (en) * 2013-11-11 2015-05-14 Yahoo! Inc. System and method for network-oblivious community detection
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102675A (en) * 2013-04-15 2014-10-15 中国人民大学 Method for detecting blogger interest community based on user relationship
US20150134402A1 (en) * 2013-11-11 2015-05-14 Yahoo! Inc. System and method for network-oblivious community detection
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王亚民 等: "基于BTM的微博舆情热点发现", 《情报杂志》 *
王亚民 等: "基于BTM的微博舆情热点发现", 《情报杂志》, vol. 35, no. 11, 30 November 2016 (2016-11-30), pages 119 - 124 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361270A (en) * 2021-05-25 2021-09-07 浙江工业大学 Short text optimization topic model method oriented to service data clustering
CN113378558A (en) * 2021-05-25 2021-09-10 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN113378558B (en) * 2021-05-25 2024-04-16 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN113361270B (en) * 2021-05-25 2024-05-10 浙江工业大学 Short text optimization topic model method for service data clustering
CN114066669A (en) * 2021-10-28 2022-02-18 华南理工大学 Manufacturing service discovery method for cloud manufacturing
CN114066669B (en) * 2021-10-28 2024-05-03 华南理工大学 Cloud manufacturing-oriented manufacturing service discovery method
CN114491013A (en) * 2021-12-09 2022-05-13 重庆邮电大学 Topic mining method, storage medium and system for merging syntactic structure information
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model
CN116432639B (en) * 2023-05-31 2023-08-25 华东交通大学 News element word mining method based on improved BTM topic model

Similar Documents

Publication Publication Date Title
Sharma et al. Sentimental analysis of twitter data with respect to general elections in India
Salloum et al. Analysis and classification of Arabic newspapers’ Facebook pages using text mining techniques
Nguyen et al. Real-time event detection for online behavioral analysis of big social data
CN112632215A (en) Community discovery method and system based on word-pair semantic topic model
Beskow et al. Its all in a name: detecting and labeling bots by their name
Zhao et al. Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder
Montejo-Ráez et al. Ranked wordnet graph for sentiment polarity classification in twitter
Liu Sentiment analysis: A multi-faceted problem
Khan et al. US Based COVID-19 tweets sentiment analysis using textblob and supervised machine learning algorithms
CN106354818B (en) Social media-based dynamic user attribute extraction method
Anwar et al. A social graph based text mining framework for chat log investigation
US11010687B2 (en) Detecting abusive language using character N-gram features
US9407589B2 (en) System and method for following topics in an electronic textual conversation
WO2021184640A1 (en) Sparse matrix-based product pushing method and apparatus, computer device, and medium
CN106569989A (en) De-weighting method and apparatus for short text
Kim et al. TwitterTrends: a spatio-temporal trend detection and related keywords recommendation scheme
CN111125305A (en) Hot topic determination method and device, storage medium and electronic equipment
CN114298007A (en) Text similarity determination method, device, equipment and medium
Sultana et al. Authorship recognition of tweets: A comparison between social behavior and linguistic profiles
US11061975B2 (en) Cognitive content suggestive sharing and display decay
Phuvipadawat et al. Detecting a multi-level content similarity from microblogs based on community structures and named entities
Sharma et al. Fake news detection using deep learning
Shi et al. SRTM: A Sparse RNN-Topic Model for Discovering Bursty Topics in Big Data of Social Networks.
Richardson et al. Topic models: A tutorial with R
Deshpande et al. A survey on: classification of Twitter data using sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210409

RJ01 Rejection of invention patent application after publication