CN113870041A - Microblog topic detection method based on message passing and graph prior distribution - Google Patents
Microblog topic detection method based on message passing and graph prior distribution Download PDFInfo
- Publication number
- CN113870041A CN113870041A CN202111052898.3A CN202111052898A CN113870041A CN 113870041 A CN113870041 A CN 113870041A CN 202111052898 A CN202111052898 A CN 202111052898A CN 113870041 A CN113870041 A CN 113870041A
- Authority
- CN
- China
- Prior art keywords
- user
- users
- distribution
- topic
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 43
- 230000003993 interaction Effects 0.000 claims abstract description 40
- 230000002452 interceptive effect Effects 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 230000000116 mitigating effect Effects 0.000 claims 1
- 238000012733 comparative method Methods 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000008846 dynamic interplay Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Business, Economics & Management (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a microblog topic detection method based on message passing and graph prior distribution, which comprises the following steps of: (1) constructing a user-level social network according to the interactive relation among users on the basis of microblog linguistic data; (2) message-passing based user node embedded representation: integrating content information and structural information of posts in social media by using a graph volume network, and embedding the interactive relation among users into a user node embedded representation; (3) generating topics from an encoder based on a graph prior variation: the embedded representation of the user nodes integrating the user interaction relationship is used as input, the standard Gaussian prior distribution of the variational self-encoder is replaced by the graph prior distribution containing the user interaction, and the correlation among users is considered in the topic inference process. In general, two-stage integrated user interaction is inferred from user node embedded representation and topics. The topic detected by the method better pays attention to the correlation among users, and higher consistency is obtained.
Description
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to a microblog topic detection method based on message passing and graph prior distribution.
Background
The rapid development of the internet brings great progress to our lives. The popularity of social media has enabled everyone to have a platform that can post their opinions and views. Thus, a large amount of short texts are generated every day, and analyzing topics in the short texts is an important task, but the analysis is time-consuming and labor-consuming manually. The topic model can automatically extract document-topic distribution and topic-word distribution, and helps people to analyze texts and master text information quickly.
Traditional topic models, such as LDA, are widely used to find potential topics from a text corpus. Essentially, these methods reveal underlying topics by implicitly capturing word co-occurrence patterns. However, they face a severe data sparseness problem (i.e., sparse post-level word co-occurrence patterns) when applied to short posts.
In order to solve the above problems, there have been some successful studies: (1) polymerization-based methods: some studies have aggregated multiple posts based on heuristic strategies. Aggregation policies include author relationship based aggregation, dialog relationship based aggregation, and the like; BTM and other methods directly model the generation process of biterms (namely word pairs). (2) Representation-based learning methods: some methods reveal topics by modeling co-occurrence patterns of potential concepts, and others effectively fuse context information of words. (3) Method based on social context: such methods jointly model textual information and social network structure information. It models the social network structure and divides messages into leader messages or follower messages. However, standard methods for learning probabilistic generative models, such as Variational techniques (Variational techniques) and Gibbs sampling (Gibbs sampling), have high computational complexity in the posterior reasoning, which prevents the methods from being applied to complex social media scenarios.
A Variational auto-encoder (VAE) is a common parameter inference framework for topic detection, which can identify the structure of data and learn its potential distribution. NVDM is a typical VAE-based topic model. It inputs each document independently into the inference network, calculates the mean and variance of the topic posterior distribution. And then extracting a potential topic vector from the posterior variation distribution. And finally reconstructing the input document by generating a network. It is designed for long documents and IATM is a classical VAE-based neural topic model for social media topic detection. It inputs a plurality of short posts, learns the edge-embedded representation in the social network by mining user dynamic interactions. The edge-embedded representation is also independently input into the VAE to infer the topic-word distribution at the corpus level. In essence, the IATM integrates presentation learning and social context on a VAE basis.
While the previous approach embeds user interactions into the edge representation, the VAE assumes that each data point is independent. Thus, the relevance between users or posts is attenuated when computing the potential semantic vectors. In a social network, interactions may mean related relationships or interests. The latent semantic vector is crucial for topic reasoning. Therefore, the interaction features among the users are more reasonably integrated into the latent semantic vector.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a microblog topic detection method based on message passing and graph prior distribution. User interaction information in social media is considered from both the user node embedded representation and topic inference stages. In the user node embedded representation stage, the graph volume network learning is utilized to integrate the user node embedded representation of the social network structure information and the post message content information, and meanwhile, the interaction relation of the user is embedded into the user node embedded representation. In the topic reasoning stage, graph prior distribution is introduced, and the interaction relationship is blended into the prior distribution of the VAE, so that the potential topic vector of the user contains the interaction relationship. Finally, the VAE reasoning obtains topic distribution considering user relevance, and obtains topics with higher consistency.
The purpose of the invention is realized by the following technical scheme:
a microblog topic detection method based on message passing and graph prior distribution is characterized by comprising the following steps:
(1) constructing a user-level social network: taking a user as a network node and an interactive relation as an edge in a network;
(2) encoding user interactions through a message passing mechanism: introducing a graph neural network, integrating content information and structural information of posts in social media by using a message transfer mechanism, and embedding interactive relations among users into a user node embedded representation;
(3) generating topics from an encoder based on a graph prior variation: the user node embedded representation integrated with the user interaction relationship is used as input, a standard Gaussian prior in a variational auto-encoder (VAE) adopting the standard Gaussian distribution as prior distribution is replaced by graph prior distribution containing the user interaction relationship, and the correlation among users is considered in the topic inference process.
Further, the step (1) specifically comprises:
constructing a user-level social network G (V, E, T) according to the forwarding and comment relations among users; wherein V ═ { V ═ ViI 1. ltoreq. i.ltoreq.n is a node set, viRepresenting the ith user in the social network, wherein n represents the number of the users; e ═ EijI1. ltoreq. i, j. ltoreq. n is the set of edges if viRepresented users i and vjThe represented user j has an interaction, then eij1 is ═ 1; if v isiRepresented users i and vjThe represented user j has never interacted, then e ij0; the post published by the user is used as the attribute information of the user node; t ═ T1,t2,...,tnIs a collection of posts, where each post tiContent representing a post of an ith user; in order to relieve data sparsity, a user-based aggregation strategy is adopted to aggregate all posts of a user, including a source post, a forwarding post and a reply message; obtaining an adjacency matrix A of the user-level social network according to the interactive relation among the users; replacing each word in the post with a corresponding word embedding representation according to the post set and the T to obtain an attribute vector of each user so as to obtain an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
Further, the step (2) specifically comprises:
learning a user node embedded representation using network embedding techniques; each post in a social network is short and informal in expression, so the representation of the post is important for learning. Using only Bag of Words (BoW) vectors as a representation of user nodes may face the data sparseness problem, affecting the performance of topic inference. According to the social relevance theory, more similar content is concerned among friends. Thus, sparsity of data is mitigated by modeling user interactions in a social network; the method comprises the steps that the capability of a graph convolution network GCN for aggregating information of surrounding nodes is considered, the interaction behavior among friends is modeled by the graph convolution network GCN, and user node embedding expression is learned; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
whereinI represents a diagonal matrix, and all diagonal elements are 1;a degree matrix representing a adjacency matrix; x represents an attribute matrix; w1And W2Is a parameter of the graph convolution network; using ReLU as activation function, H2The method comprises the following steps of (1) forming a matrix by embedding and representing all user nodes in a social network;
topic detection is an unsupervised approach, so the graph-convolution network has no labels for training. The microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
given user viWith the aim of enabling the user viWith its associated user node vj∈NiThe similarity of (2) is maximized; the related user nodes refer to a first-order neighbor set N in which edges are directly connected in the social networkiA user node of (1); in the formula (5), hiIs an embedded representation of user node i in H2, HjIs an embedded representation of user node j in H2, HuIs H2Embedded representation of a user node u, vue.V represents all user nodes in the social network.
Based on a GCN message transmission mechanism, related contents of first-order neighbor users are spread to the attributes of the connected users, and the data sparsity of a single user is made up; meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept.
Further, the step (3) specifically comprises:
step (2) the interactive relation between users is coded into the user node embedded representation and is used as the input of the graph prior variation self-coder in step (3); the variational self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of topic posterior distribution, samples from the topic posterior distribution through a heavy parameter skill to obtain a topic vector, and obtains the topic distribution through softmax; each user node is embedded and represented and then reconstructed by a decoder;
the variational self-encoder adopting the standard Gaussian distribution as the prior distribution can deduce potential topics from independent long documents; for the case of multi-user input, it assumes that the users are independent, which impairs the relevance between users in the topic inference process. The prior distribution in VAE takes a standard gaussian distribution, which results in independence of data points. According to the microblog topic detection method, a graph prior distribution is constructed to replace a standard Gaussian distribution; the graph prior distribution contains user interaction relationships, so that topic vectors of each user obey corresponding interaction relationships among users in a social network. The graph prior distribution is shown in the following formula:
wherein z isiAnd zjIs user vi,vjVector of potential topics of, ps(zi) Using a standard gaussian distribution;the following form is adopted:
alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower bound of the graph prior variation of the self-encoder is obtained, and the formula is as follows:
wherein the variation distribution q (z)i,zj|hi,hj) The following form is adopted:
wherein, mui,μjAndis the mean and variance of the variational distribution; c. CijIs ziAnd zjThe correlation coefficient of (a); the final loss function is formulated as follows:
the graph prior variation self-encoder obtained from the loss function consists of the following three parts: 1) a variational network which is represented by a user node embedding [ h ]i]As an input, the mean value μ is calculatediSum variance2) Correlation coding of networks with pairs of user nodes [ h ]i,hj]As input, calculating the correlation coefficient c of the potential topic vectors of the two usersij(ii) a 3) Generating a network, as with a variational autocoder using a standard Gaussian distribution as the prior distribution, with a latent variable ziReconstructing the original user node embedding representation for input yields h'i. In general, the method preserves the interaction between users from the two stages of user node embedded representation and topic reasoning, and considers the correlation between friends, thereby obtaining a more coherent topic.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. in order to relieve the problem of data sparseness in social media topic detection, the method provided by the invention simultaneously considers the post text content and the social network structure and integrates the user interaction relationship. The structural information is used as supplement, so that the context information in the social network is enriched;
2. in order to introduce user correlation, user interaction relationship is integrated from two stages of user node embedded representation and topic reasoning. Comprehensively considering the user relevance in the whole period of social media topic detection;
3. in the user node embedding and representing stage, by utilizing a message transmission mechanism of a graph convolution network, on one hand, information of friend users around each user can be aggregated, and sparsity is relieved; on the other hand, the social network structure can be integrated into the user node embedded representation, and user interaction relations are reserved in the user node embedded representation;
4. in the topic inference stage, topic inference is carried out by using a variational self-encoder based on graph priors. Unlike conventional variational autocoder VAE, the inventive method replaces the standard gaussian distribution with a prior distribution. The graph prior distribution takes user interaction into account, and potential topic vectors of the users obey the interaction structure among the users. The final inferred topic has better consistency.
5. The experimental result of the Sina microblog data set in three months fully shows the effectiveness of the method, and proves the effectiveness of the introduced graph prior distribution on microblog topic mining.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention
FIG. 2a is a visualization of a topic vector inferred from a coder using a standard Gaussian distribution as a standard variation of a priori; fig. 2b is a topic vector visualization graph inferred from the encoder using the variance of the graph prior.
Fig. 3 is a variation of the continuity of the evaluation index topic with the parameter α of the prior distribution of the graph in the specific embodiment.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The specific implementation method of the invention is given by taking a real microblog data set of 3 months as an example. The whole system algorithm process comprises three steps of constructing a user-level social network, representing user node embedding based on message passing and generating topics based on a graph prior variation self-encoder, and is shown in figure 1. The method comprises the following specific steps:
s1, constructing a user-level social network:
the predecessors worked on the Sina microblog and collected related microblogs covering 50 hot topics in three months of 5 months, 6 months and 7 months in 2014. In this embodiment, a user-level social network is constructed based on the microblog corpus. The method comprises the following specific steps: 1) filtering users without forwarding or commenting relations; 2) splicing all posts of a user together to serve as attribute information of the user; 3) according to the interactive relationship between the users, if the interactive relationship exists between the two users, an edge exists between the two user nodes, and otherwise, the edge does not exist. The post text of the user is used as attribute information of the user node in the social network.
Table 1 shows the statistical information for three monthly data sets, as follows: the month 5 dataset comprised 8907 users, 10435 interactions in total, with a vocabulary size of 5914; the 6 month dataset comprised 19293 users in total, 35962 interactions, with a vocabulary size of 9368; the 7-month dataset consists of 16990 users, 20971 interactions, and a vocabulary size of 9663.
TABLE 1 microblog data set statistics
S2, user node embedding representation based on message transmission:
using only Bag of Words (BoW) vectors as a representation of user nodes may face the data sparseness problem, affecting the performance of topic inference. Because each post is short and expressive informally, representation learning of posts in social networks is important. And in consideration of the capability of the graph convolution network for aggregating the information of the surrounding nodes, modeling the interaction behavior between friends by using two layers of GCNs, and learning the embedded representation of the user nodes. Based on the GCN message transmission mechanism, the related content of the neighbor users is spread to the attributes of the connected users, and the data sparsity of a single user is made up. Meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept. The loss function for this step is shown in the following equation:
s3, generating topics based on a graph prior variation self-encoder:
with the user node embedded representation as input, topics are inferred from the encoder using the variational score. The variational self-encoder comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of the posterior distribution of the topic, and the formula is as follows:
μi=MLP(hi)
wherein h isiRepresenting the i-th user node embedding representation, mui,Mean and variance are indicated, respectively. MLP stands for Multi-Layer Perceptron (MLP). By heavy parameter technique zi=μi+∈*σiPotential topic vector z can be sampled from posterior distributioni. Topic distribution θ ═ p (t)1|h),p(t2|h),...,p(tk| h)) can be obtained by the softmax function:
θi=softmax(zi)
where h represents the user representation of the input, t1Representing a first topic, p (t)1Ih) represents the probability of the first topic appearing. K represents the total number of topics. Each user node embedded representation is reconstructed by the decoder network, and the decoder also selects the MLP. The parameter W of the decoder is the topic-word distribution phi of the corpusword=(p(w|t1),p(w|t2),...,p(w|tK)). The specific formula is as follows:
di=softmax(θiW)
h′i=f(Wddi+bd)
wherein, p (w | t)1) Representing the probability of each word appearing under the first topic. diAnd representing the probability value of each word appearing in the attribute information of each user node. WdParameters representing a neural network, bdRepresenting the deviation of the neural network. h'iThe user node representing the decoder reconstruction embeds the representation.
The variational auto-encoder with the standard gaussian distribution as the prior distribution can deduce the potential topics from the independent long documents. For the case of multi-user input, it assumes that the users are independent, which impairs the relevance between users in the topic inference process. The invention first constructs a graph prior distribution to replace the standard gaussian. The prior distribution includes user interaction relationships such that the topic vector of each user obeys the correspondence between users in the social network. And then calculating a loss function according to the new lower bound of the variation, wherein the specific formula of the loss function is as follows:
the meaning of the symbols in the formula is as described above.
In the specific implementation process, the post text of each user node is preprocessed firstly. After aggregation, the post text of each user will contain 50 words. Randomly initialize word embedding and set its dimension to 200. In the GCN, the dimension of the hidden layer is set to 200. In the variational autocoder, the dimension of the first layer encoder is set to 200. The learning rate is set to 0.001. The method employs dropout in both the GCN and the associated VAE to avoid overfitting. Adam is used to optimize the penalty function for each module.
To verify the validity of the method of the invention, the method of the invention (MGTM) is compared with the currently advanced and representative method (BAT)[1]、BTM[2]、LCTM[3]、LeadLDA[4]、AdjEnc[5]、IATM[6]) And variants of the method of the invention (MGTM (S)tandard Gaussian)).
BAT explores the application of bi-directional countermeasure training in neural topic models. It is designed for long documents and faces severe data sparsity problems when applied to short documents.
BTMs learn topics by directly modeling the generation of word pairs throughout a corpus.
LCTM reveals topics by modeling co-occurrence patterns of potential concepts that are used to capture conceptual similarities of words.
The LeadLDA distinguishes posts as leader posts and follower posts to varying degrees where leader information and follower posts contain key topic words.
AdjEnc introduces topic reasoning for network structure in structured long documents such as academic papers, web pages, etc.
The IATM models the dynamic interaction of the user to learn edge embedding of interaction perception, and generates topics using neuro-variational reasoning.
MGTM (Standard Gaussian) degenerates to a standard Gaussian distribution as a prior, verifying the effect of the prior distribution of the map.
The evaluation index of the model performance adopts topic coherence (topic coherence), and the formula is as follows:
tables 2, 3 and 4 show topic consistency results of the method and all comparison methods on the three-month microblog data set respectively. For each data set, consistency scores of top 10(N is 10), 15(N is 15), and 20(N is 20) words of the inferred topic when the topic number K is 50 and 100 are recorded. Higher topic continuity indicates better performance of the model.
TABLE 2 comparison of Performance of the inventive and comparative methods on a 5 month dataset
TABLE 3 comparison of Performance of the inventive and comparative methods on a 6 month dataset
TABLE 4 comparison of Performance of the inventive and comparative methods on a 7 month dataset
As can be seen from the topic consistency results in tables 2, 3 and 4, the interaction relationship of modeling users at two stages of embedding representation and topic reasoning in the user nodes can enable topics to be embedded into certain user relevance, and the topic consistency is further improved. In order to study whether graph prior distribution promotes user interaction in storing user latent topic vectors, fig. 2a and 2b show visual images of the latent topic vectors. Where FIG. 2a is a topic vector inferred from the encoder using a standard Gaussian distribution as a prior variation; fig. 2b is a graph a priori variational inferred topic vector from the encoder. The part marked on the circle can be seen, and the user topic vector with better aggregative property can be obtained by the method. In order to further study the influence of the parameter α on topic continuity in graph prior distribution, fig. 3 shows the relevant change of topic continuity scores and the parameter α on a three-month microblog data set by the method of the present invention.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Reference documents:
[1]Rui Wang,Xuemeng Hu,Deyu Zhou,Yulan He,Yuxuan Xiong,Chenchen Ye,and Haiyang Xu.2020.Neural Topic Modeling with Bidirectional Adversarial Training.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.340-350.
[2]Xiaohui Yan,Jiafeng Guo,Yanyan Lan,and Xueqi Cheng.2013.A biterm topic model for short texts.In In Proceedings of the 22nd international conference on World Wide Web.ACM.1445-1456.
[3]Weihua Hu and Jun’ichi Tsujii.2016.A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).380-386.
[4]Jing Li,Ming Liao,Wei Gao,Yulan He,and Kam-Fai Wong.2016.Topic Extraction from Microblog Posts Using Conversation Structures.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2114-2123.
[5]Ce Zhang and Hady W.Lauw.2020.Topic Modeling on Document Networks with Adjacent-Encoder.Proceedings of the AAAI Conference on Artificial Intelligence 34,04(2020),6737-6745.
[6]Ruifang He,Xuefei Zhang,Di Jin,Longbiao Wang,Jianwu Dang,and Xiangang Li.2018.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In Proceedings of the 27th International Conference on Computational Linguistics.1398-1409.
the present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (4)
1. A microblog topic detection method based on message passing and graph prior distribution is characterized by comprising the following steps:
(1) constructing a user-level social network: taking a user as a network node and an interactive relation as an edge in a network;
(2) encoding user interactions through a message passing mechanism: introducing a graph neural network, integrating content information and structural information of posts in social media by using a message transfer mechanism, and embedding interactive relations among users into a user node embedded representation;
(3) generating topics from an encoder based on a graph prior variation: the user node embedded representation integrated with the user interaction relationship is used as input, a standard Gaussian prior in a variational auto-encoder (VAE) adopting the standard Gaussian distribution as prior distribution is replaced by graph prior distribution containing the user interaction relationship, and the correlation among users is considered in the topic inference process.
2. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (1) specifically comprises:
constructing a user-level social network G (V, E, T) according to the forwarding and comment relations among users; wherein V ═ { V ═ ViI 1. ltoreq. i.ltoreq.n is a node set, viRepresenting the ith user in the social network, wherein n represents the number of the users; e ═ EijI1. ltoreq. i, j. ltoreq. n is the set of edges if viRepresented users i and vjThe represented user j has an interaction, then eij1 is ═ 1; if v isiRepresented users i and vjThe represented user j has never interacted, then eij0; the post published by the user is used as the attribute information of the user node; t ═ T1,t2,…,tnIs a collection of posts, where each post tiContent representing a post of an ith user; in order to relieve data sparsity, a user-based aggregation strategy is adopted for aggregationAll posts of the user comprise a source post, a forwarding post and a reply message; obtaining an adjacency matrix A of the user-level social network according to the interactive relation among the users; replacing each word in the post with a corresponding word embedding representation according to the post set and the T to obtain an attribute vector of each user so as to obtain an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
3. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (2) specifically comprises:
learning a user node embedded representation using network embedding techniques; mitigating sparsity of data by modeling user interactions in a social network; the method comprises the steps that the capability of a graph convolution network GCN for aggregating information of surrounding nodes is considered, the interaction behavior among friends is modeled by the graph convolution network GCN, and user node embedding expression is learned; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
whereinI represents a diagonal matrix, and all diagonal elements are 1;a degree matrix representing a adjacency matrix; x representsAn attribute matrix; w1And W2Is a parameter of the graph convolution network; using ReLU as activation function, H2The method comprises the following steps of (1) forming a matrix by embedding and representing all user nodes in a social network;
the microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
given user viWith the aim of enabling the user viWith its associated user node vj∈NiThe similarity of (2) is maximized; the related user nodes refer to a first-order neighbor set N in which edges are directly connected in the social networkiA user node of (1); in the formula (5), hiIs H2Embedded representation of user node i, hjIs H2Embedded representation of user node j, huIs H2Embedded representation of a user node u, vuE, representing all user nodes in the social network by V;
based on a GCN message transmission mechanism, related contents of first-order neighbor users are spread to the attributes of the connected users, and the data sparsity of a single user is made up; meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept.
4. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (3) specifically comprises:
step (2) the interactive relation between users is coded into the user node embedded representation and is used as the input of the graph prior variation self-coder in step (3); the variational self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of topic posterior distribution, samples from the topic posterior distribution through a heavy parameter skill to obtain a topic vector, and obtains the topic distribution through softmax; each user node is embedded and represented and then reconstructed by a decoder;
according to the microblog topic detection method, a graph prior distribution is constructed to replace a standard Gaussian distribution; the graph prior distribution comprises user interaction relations, so that topic vectors of all users obey corresponding interaction relations among the users in the social network; the graph prior distribution is shown in the following formula:
wherein z isiAnd zjIs user vi,vjVector of potential topics of, ps(zi) Is a monomodal edge distribution, where a standard gaussian distribution is used;is a two-state edge distribution, which adopts the following form:
alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower bound of the graph prior variation of the self-encoder is obtained, and the formula is as follows:
wherein the variation distribution q (z)i,zj|hi,hj) The following form is adopted:
wherein, mui,μjAndis the mean and variance of the variational distribution; c. CijIs xiAnd zjThe correlation coefficient of (a); the final loss function is formulated as follows:
the graph prior variation self-encoder obtained from the loss function consists of the following three parts: 1) a variational network which is represented by a user node embedding [ h ]i]As an input, the mean value μ is calculatediSum variance2) Correlation coding of networks with pairs of user nodes [ h ]i,hj]As input, calculating the correlation coefficient c of the potential topic vectors of the two usersij(ii) a 3) Generating a network, as with a variational autocoder using a standard Gaussian distribution as the prior distribution, with a latent variable ziReconstructing the original user node embedding representation for input yields h'i。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111052898.3A CN113870041B (en) | 2021-09-07 | Microblog topic detection method based on message passing and graph priori distribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111052898.3A CN113870041B (en) | 2021-09-07 | Microblog topic detection method based on message passing and graph priori distribution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113870041A true CN113870041A (en) | 2021-12-31 |
CN113870041B CN113870041B (en) | 2024-05-24 |
Family
ID=
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150213370A1 (en) * | 2014-01-27 | 2015-07-30 | Facebook, Inc. | Label inference in a social network |
CN110232434A (en) * | 2019-04-28 | 2019-09-13 | 吉林大学 | A kind of neural network framework appraisal procedure based on attributed graph optimization |
CN110348573A (en) * | 2019-07-16 | 2019-10-18 | 腾讯科技(深圳)有限公司 | The method of training figure neural network, figure neural network unit, medium |
US20200293902A1 (en) * | 2019-03-15 | 2020-09-17 | Baidu Usa Llc | Systems and methods for mutual learning for topic discovery and word embedding |
CN112364161A (en) * | 2020-09-25 | 2021-02-12 | 天津大学 | Microblog theme mining method based on dynamic behaviors of heterogeneous social media users |
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150213370A1 (en) * | 2014-01-27 | 2015-07-30 | Facebook, Inc. | Label inference in a social network |
US20200293902A1 (en) * | 2019-03-15 | 2020-09-17 | Baidu Usa Llc | Systems and methods for mutual learning for topic discovery and word embedding |
CN110232434A (en) * | 2019-04-28 | 2019-09-13 | 吉林大学 | A kind of neural network framework appraisal procedure based on attributed graph optimization |
CN110348573A (en) * | 2019-07-16 | 2019-10-18 | 腾讯科技(深圳)有限公司 | The method of training figure neural network, figure neural network unit, medium |
CN112364161A (en) * | 2020-09-25 | 2021-02-12 | 天津大学 | Microblog theme mining method based on dynamic behaviors of heterogeneous social media users |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kaur et al. | Multimodal sentiment analysis: A survey and comparison | |
Polignano et al. | A comparison of word-embeddings in emotion detection from text using bilstm, cnn and self-attention | |
Zhao et al. | Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder | |
CN111079444A (en) | Network rumor detection method based on multi-modal relationship | |
Chen et al. | Visual and textual sentiment analysis using deep fusion convolutional neural networks | |
CN112364161B (en) | Microblog theme mining method based on dynamic behaviors of heterogeneous social media users | |
Pan et al. | Social media-based user embedding: A literature review | |
Phan et al. | Fake news detection: A survey of graph neural network methods | |
Yang et al. | Rits: Real-time interactive text steganography based on automatic dialogue model | |
CN110750648A (en) | Text emotion classification method based on deep learning and feature fusion | |
CN113094596A (en) | Multitask rumor detection method based on bidirectional propagation diagram | |
CN112199607A (en) | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood | |
CN110532378B (en) | Short text aspect extraction method based on topic model | |
Chaudhuri | Visual and text sentiment analysis through hierarchical deep learning networks | |
CN111026866A (en) | Domain-oriented text information extraction clustering method, device and storage medium | |
CN114818724A (en) | Construction method of social media disaster effective information detection model | |
Gan et al. | Microblog sentiment analysis via user representative relationship under multi-interaction hybrid neural networks | |
Bhardwaj | Sentiment Analysis and Text Classification for Social Media Contents Using Machine Learning Techniques | |
Zhang et al. | Do sentence-level sentiment interactions matter? sentiment mixed heterogeneous network for fake news detection | |
Biswas et al. | A new ontology-based multimodal classification system for social media images of personality traits | |
CN113870041B (en) | Microblog topic detection method based on message passing and graph priori distribution | |
CN113870041A (en) | Microblog topic detection method based on message passing and graph prior distribution | |
Deng et al. | A depression tendency detection model fusing weibo content and user behavior | |
CN113111267A (en) | Multitask rumor detection method based on bidirectional propagation diagram | |
CN113870040B (en) | Double-flow chart convolution network microblog topic detection method integrating different propagation modes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |