CN113870041A - Microblog topic detection method based on message passing and graph prior distribution - Google Patents

Microblog topic detection method based on message passing and graph prior distribution Download PDF

Info

Publication number
CN113870041A
CN113870041A CN202111052898.3A CN202111052898A CN113870041A CN 113870041 A CN113870041 A CN 113870041A CN 202111052898 A CN202111052898 A CN 202111052898A CN 113870041 A CN113870041 A CN 113870041A
Authority
CN
China
Prior art keywords
user
users
distribution
topic
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111052898.3A
Other languages
Chinese (zh)
Other versions
CN113870041B (en
Inventor
贺瑞芳
王浩成
刘焕宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111052898.3A priority Critical patent/CN113870041B/en
Priority claimed from CN202111052898.3A external-priority patent/CN113870041B/en
Publication of CN113870041A publication Critical patent/CN113870041A/en
Application granted granted Critical
Publication of CN113870041B publication Critical patent/CN113870041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a microblog topic detection method based on message passing and graph prior distribution, which comprises the following steps of: (1) constructing a user-level social network according to the interactive relation among users on the basis of microblog linguistic data; (2) message-passing based user node embedded representation: integrating content information and structural information of posts in social media by using a graph volume network, and embedding the interactive relation among users into a user node embedded representation; (3) generating topics from an encoder based on a graph prior variation: the embedded representation of the user nodes integrating the user interaction relationship is used as input, the standard Gaussian prior distribution of the variational self-encoder is replaced by the graph prior distribution containing the user interaction, and the correlation among users is considered in the topic inference process. In general, two-stage integrated user interaction is inferred from user node embedded representation and topics. The topic detected by the method better pays attention to the correlation among users, and higher consistency is obtained.

Description

Microblog topic detection method based on message passing and graph prior distribution
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to a microblog topic detection method based on message passing and graph prior distribution.
Background
The rapid development of the internet brings great progress to our lives. The popularity of social media has enabled everyone to have a platform that can post their opinions and views. Thus, a large amount of short texts are generated every day, and analyzing topics in the short texts is an important task, but the analysis is time-consuming and labor-consuming manually. The topic model can automatically extract document-topic distribution and topic-word distribution, and helps people to analyze texts and master text information quickly.
Traditional topic models, such as LDA, are widely used to find potential topics from a text corpus. Essentially, these methods reveal underlying topics by implicitly capturing word co-occurrence patterns. However, they face a severe data sparseness problem (i.e., sparse post-level word co-occurrence patterns) when applied to short posts.
In order to solve the above problems, there have been some successful studies: (1) polymerization-based methods: some studies have aggregated multiple posts based on heuristic strategies. Aggregation policies include author relationship based aggregation, dialog relationship based aggregation, and the like; BTM and other methods directly model the generation process of biterms (namely word pairs). (2) Representation-based learning methods: some methods reveal topics by modeling co-occurrence patterns of potential concepts, and others effectively fuse context information of words. (3) Method based on social context: such methods jointly model textual information and social network structure information. It models the social network structure and divides messages into leader messages or follower messages. However, standard methods for learning probabilistic generative models, such as Variational techniques (Variational techniques) and Gibbs sampling (Gibbs sampling), have high computational complexity in the posterior reasoning, which prevents the methods from being applied to complex social media scenarios.
A Variational auto-encoder (VAE) is a common parameter inference framework for topic detection, which can identify the structure of data and learn its potential distribution. NVDM is a typical VAE-based topic model. It inputs each document independently into the inference network, calculates the mean and variance of the topic posterior distribution. And then extracting a potential topic vector from the posterior variation distribution. And finally reconstructing the input document by generating a network. It is designed for long documents and IATM is a classical VAE-based neural topic model for social media topic detection. It inputs a plurality of short posts, learns the edge-embedded representation in the social network by mining user dynamic interactions. The edge-embedded representation is also independently input into the VAE to infer the topic-word distribution at the corpus level. In essence, the IATM integrates presentation learning and social context on a VAE basis.
While the previous approach embeds user interactions into the edge representation, the VAE assumes that each data point is independent. Thus, the relevance between users or posts is attenuated when computing the potential semantic vectors. In a social network, interactions may mean related relationships or interests. The latent semantic vector is crucial for topic reasoning. Therefore, the interaction features among the users are more reasonably integrated into the latent semantic vector.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a microblog topic detection method based on message passing and graph prior distribution. User interaction information in social media is considered from both the user node embedded representation and topic inference stages. In the user node embedded representation stage, the graph volume network learning is utilized to integrate the user node embedded representation of the social network structure information and the post message content information, and meanwhile, the interaction relation of the user is embedded into the user node embedded representation. In the topic reasoning stage, graph prior distribution is introduced, and the interaction relationship is blended into the prior distribution of the VAE, so that the potential topic vector of the user contains the interaction relationship. Finally, the VAE reasoning obtains topic distribution considering user relevance, and obtains topics with higher consistency.
The purpose of the invention is realized by the following technical scheme:
a microblog topic detection method based on message passing and graph prior distribution is characterized by comprising the following steps:
(1) constructing a user-level social network: taking a user as a network node and an interactive relation as an edge in a network;
(2) encoding user interactions through a message passing mechanism: introducing a graph neural network, integrating content information and structural information of posts in social media by using a message transfer mechanism, and embedding interactive relations among users into a user node embedded representation;
(3) generating topics from an encoder based on a graph prior variation: the user node embedded representation integrated with the user interaction relationship is used as input, a standard Gaussian prior in a variational auto-encoder (VAE) adopting the standard Gaussian distribution as prior distribution is replaced by graph prior distribution containing the user interaction relationship, and the correlation among users is considered in the topic inference process.
Further, the step (1) specifically comprises:
constructing a user-level social network G (V, E, T) according to the forwarding and comment relations among users; wherein V ═ { V ═ ViI 1. ltoreq. i.ltoreq.n is a node set, viRepresenting the ith user in the social network, wherein n represents the number of the users; e ═ EijI1. ltoreq. i, j. ltoreq. n is the set of edges if viRepresented users i and vjThe represented user j has an interaction, then eij1 is ═ 1; if v isiRepresented users i and vjThe represented user j has never interacted, then e ij0; the post published by the user is used as the attribute information of the user node; t ═ T1,t2,...,tnIs a collection of posts, where each post tiContent representing a post of an ith user; in order to relieve data sparsity, a user-based aggregation strategy is adopted to aggregate all posts of a user, including a source post, a forwarding post and a reply message; obtaining an adjacency matrix A of the user-level social network according to the interactive relation among the users; replacing each word in the post with a corresponding word embedding representation according to the post set and the T to obtain an attribute vector of each user so as to obtain an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
Further, the step (2) specifically comprises:
learning a user node embedded representation using network embedding techniques; each post in a social network is short and informal in expression, so the representation of the post is important for learning. Using only Bag of Words (BoW) vectors as a representation of user nodes may face the data sparseness problem, affecting the performance of topic inference. According to the social relevance theory, more similar content is concerned among friends. Thus, sparsity of data is mitigated by modeling user interactions in a social network; the method comprises the steps that the capability of a graph convolution network GCN for aggregating information of surrounding nodes is considered, the interaction behavior among friends is modeled by the graph convolution network GCN, and user node embedding expression is learned; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
Figure BDA0003251221360000031
Figure BDA0003251221360000032
Figure BDA0003251221360000033
wherein
Figure BDA0003251221360000034
I represents a diagonal matrix, and all diagonal elements are 1;
Figure BDA0003251221360000035
a degree matrix representing a adjacency matrix; x represents an attribute matrix; w1And W2Is a parameter of the graph convolution network; using ReLU as activation function, H2The method comprises the following steps of (1) forming a matrix by embedding and representing all user nodes in a social network;
topic detection is an unsupervised approach, so the graph-convolution network has no labels for training. The microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
Figure BDA0003251221360000036
Figure BDA0003251221360000037
given user viWith the aim of enabling the user viWith its associated user node vj∈NiThe similarity of (2) is maximized; the related user nodes refer to a first-order neighbor set N in which edges are directly connected in the social networkiA user node of (1); in the formula (5), hiIs an embedded representation of user node i in H2, HjIs an embedded representation of user node j in H2, HuIs H2Embedded representation of a user node u, vue.V represents all user nodes in the social network.
Based on a GCN message transmission mechanism, related contents of first-order neighbor users are spread to the attributes of the connected users, and the data sparsity of a single user is made up; meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept.
Further, the step (3) specifically comprises:
step (2) the interactive relation between users is coded into the user node embedded representation and is used as the input of the graph prior variation self-coder in step (3); the variational self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of topic posterior distribution, samples from the topic posterior distribution through a heavy parameter skill to obtain a topic vector, and obtains the topic distribution through softmax; each user node is embedded and represented and then reconstructed by a decoder;
the variational self-encoder adopting the standard Gaussian distribution as the prior distribution can deduce potential topics from independent long documents; for the case of multi-user input, it assumes that the users are independent, which impairs the relevance between users in the topic inference process. The prior distribution in VAE takes a standard gaussian distribution, which results in independence of data points. According to the microblog topic detection method, a graph prior distribution is constructed to replace a standard Gaussian distribution; the graph prior distribution contains user interaction relationships, so that topic vectors of each user obey corresponding interaction relationships among users in a social network. The graph prior distribution is shown in the following formula:
Figure BDA0003251221360000041
wherein z isiAnd zjIs user vi,vjVector of potential topics of, ps(zi) Using a standard gaussian distribution;
Figure BDA0003251221360000042
the following form is adopted:
Figure BDA0003251221360000043
alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower bound of the graph prior variation of the self-encoder is obtained, and the formula is as follows:
Figure BDA0003251221360000044
wherein the variation distribution q (z)i,zj|hi,hj) The following form is adopted:
Figure BDA0003251221360000045
wherein, mui,μjAnd
Figure BDA0003251221360000046
is the mean and variance of the variational distribution; c. CijIs ziAnd zjThe correlation coefficient of (a); the final loss function is formulated as follows:
Figure BDA0003251221360000047
the graph prior variation self-encoder obtained from the loss function consists of the following three parts: 1) a variational network which is represented by a user node embedding [ h ]i]As an input, the mean value μ is calculatediSum variance
Figure BDA0003251221360000048
2) Correlation coding of networks with pairs of user nodes [ h ]i,hj]As input, calculating the correlation coefficient c of the potential topic vectors of the two usersij(ii) a 3) Generating a network, as with a variational autocoder using a standard Gaussian distribution as the prior distribution, with a latent variable ziReconstructing the original user node embedding representation for input yields h'i. In general, the method preserves the interaction between users from the two stages of user node embedded representation and topic reasoning, and considers the correlation between friends, thereby obtaining a more coherent topic.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. in order to relieve the problem of data sparseness in social media topic detection, the method provided by the invention simultaneously considers the post text content and the social network structure and integrates the user interaction relationship. The structural information is used as supplement, so that the context information in the social network is enriched;
2. in order to introduce user correlation, user interaction relationship is integrated from two stages of user node embedded representation and topic reasoning. Comprehensively considering the user relevance in the whole period of social media topic detection;
3. in the user node embedding and representing stage, by utilizing a message transmission mechanism of a graph convolution network, on one hand, information of friend users around each user can be aggregated, and sparsity is relieved; on the other hand, the social network structure can be integrated into the user node embedded representation, and user interaction relations are reserved in the user node embedded representation;
4. in the topic inference stage, topic inference is carried out by using a variational self-encoder based on graph priors. Unlike conventional variational autocoder VAE, the inventive method replaces the standard gaussian distribution with a prior distribution. The graph prior distribution takes user interaction into account, and potential topic vectors of the users obey the interaction structure among the users. The final inferred topic has better consistency.
5. The experimental result of the Sina microblog data set in three months fully shows the effectiveness of the method, and proves the effectiveness of the introduced graph prior distribution on microblog topic mining.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention
FIG. 2a is a visualization of a topic vector inferred from a coder using a standard Gaussian distribution as a standard variation of a priori; fig. 2b is a topic vector visualization graph inferred from the encoder using the variance of the graph prior.
Fig. 3 is a variation of the continuity of the evaluation index topic with the parameter α of the prior distribution of the graph in the specific embodiment.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The specific implementation method of the invention is given by taking a real microblog data set of 3 months as an example. The whole system algorithm process comprises three steps of constructing a user-level social network, representing user node embedding based on message passing and generating topics based on a graph prior variation self-encoder, and is shown in figure 1. The method comprises the following specific steps:
s1, constructing a user-level social network:
the predecessors worked on the Sina microblog and collected related microblogs covering 50 hot topics in three months of 5 months, 6 months and 7 months in 2014. In this embodiment, a user-level social network is constructed based on the microblog corpus. The method comprises the following specific steps: 1) filtering users without forwarding or commenting relations; 2) splicing all posts of a user together to serve as attribute information of the user; 3) according to the interactive relationship between the users, if the interactive relationship exists between the two users, an edge exists between the two user nodes, and otherwise, the edge does not exist. The post text of the user is used as attribute information of the user node in the social network.
Table 1 shows the statistical information for three monthly data sets, as follows: the month 5 dataset comprised 8907 users, 10435 interactions in total, with a vocabulary size of 5914; the 6 month dataset comprised 19293 users in total, 35962 interactions, with a vocabulary size of 9368; the 7-month dataset consists of 16990 users, 20971 interactions, and a vocabulary size of 9663.
TABLE 1 microblog data set statistics
Figure BDA0003251221360000061
S2, user node embedding representation based on message transmission:
using only Bag of Words (BoW) vectors as a representation of user nodes may face the data sparseness problem, affecting the performance of topic inference. Because each post is short and expressive informally, representation learning of posts in social networks is important. And in consideration of the capability of the graph convolution network for aggregating the information of the surrounding nodes, modeling the interaction behavior between friends by using two layers of GCNs, and learning the embedded representation of the user nodes. Based on the GCN message transmission mechanism, the related content of the neighbor users is spread to the attributes of the connected users, and the data sparsity of a single user is made up. Meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept. The loss function for this step is shown in the following equation:
Figure BDA0003251221360000062
Figure BDA0003251221360000063
s3, generating topics based on a graph prior variation self-encoder:
with the user node embedded representation as input, topics are inferred from the encoder using the variational score. The variational self-encoder comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of the posterior distribution of the topic, and the formula is as follows:
μi=MLP(hi)
Figure BDA0003251221360000071
wherein h isiRepresenting the i-th user node embedding representation, mui
Figure BDA0003251221360000072
Mean and variance are indicated, respectively. MLP stands for Multi-Layer Perceptron (MLP). By heavy parameter technique zi=μi+∈*σiPotential topic vector z can be sampled from posterior distributioni. Topic distribution θ ═ p (t)1|h),p(t2|h),...,p(tk| h)) can be obtained by the softmax function:
θi=softmax(zi)
where h represents the user representation of the input, t1Representing a first topic, p (t)1Ih) represents the probability of the first topic appearing. K represents the total number of topics. Each user node embedded representation is reconstructed by the decoder network, and the decoder also selects the MLP. The parameter W of the decoder is the topic-word distribution phi of the corpusword=(p(w|t1),p(w|t2),...,p(w|tK)). The specific formula is as follows:
di=softmax(θiW)
h′i=f(Wddi+bd)
wherein, p (w | t)1) Representing the probability of each word appearing under the first topic. diAnd representing the probability value of each word appearing in the attribute information of each user node. WdParameters representing a neural network, bdRepresenting the deviation of the neural network. h'iThe user node representing the decoder reconstruction embeds the representation.
The variational auto-encoder with the standard gaussian distribution as the prior distribution can deduce the potential topics from the independent long documents. For the case of multi-user input, it assumes that the users are independent, which impairs the relevance between users in the topic inference process. The invention first constructs a graph prior distribution to replace the standard gaussian. The prior distribution includes user interaction relationships such that the topic vector of each user obeys the correspondence between users in the social network. And then calculating a loss function according to the new lower bound of the variation, wherein the specific formula of the loss function is as follows:
Figure BDA0003251221360000073
the meaning of the symbols in the formula is as described above.
In the specific implementation process, the post text of each user node is preprocessed firstly. After aggregation, the post text of each user will contain 50 words. Randomly initialize word embedding and set its dimension to 200. In the GCN, the dimension of the hidden layer is set to 200. In the variational autocoder, the dimension of the first layer encoder is set to 200. The learning rate is set to 0.001. The method employs dropout in both the GCN and the associated VAE to avoid overfitting. Adam is used to optimize the penalty function for each module.
To verify the validity of the method of the invention, the method of the invention (MGTM) is compared with the currently advanced and representative method (BAT)[1]、BTM[2]、LCTM[3]、LeadLDA[4]、AdjEnc[5]、IATM[6]) And variants of the method of the invention (MGTM (S)tandard Gaussian)).
BAT explores the application of bi-directional countermeasure training in neural topic models. It is designed for long documents and faces severe data sparsity problems when applied to short documents.
BTMs learn topics by directly modeling the generation of word pairs throughout a corpus.
LCTM reveals topics by modeling co-occurrence patterns of potential concepts that are used to capture conceptual similarities of words.
The LeadLDA distinguishes posts as leader posts and follower posts to varying degrees where leader information and follower posts contain key topic words.
AdjEnc introduces topic reasoning for network structure in structured long documents such as academic papers, web pages, etc.
The IATM models the dynamic interaction of the user to learn edge embedding of interaction perception, and generates topics using neuro-variational reasoning.
MGTM (Standard Gaussian) degenerates to a standard Gaussian distribution as a prior, verifying the effect of the prior distribution of the map.
The evaluation index of the model performance adopts topic coherence (topic coherence), and the formula is as follows:
Figure BDA0003251221360000081
tables 2, 3 and 4 show topic consistency results of the method and all comparison methods on the three-month microblog data set respectively. For each data set, consistency scores of top 10(N is 10), 15(N is 15), and 20(N is 20) words of the inferred topic when the topic number K is 50 and 100 are recorded. Higher topic continuity indicates better performance of the model.
TABLE 2 comparison of Performance of the inventive and comparative methods on a 5 month dataset
Figure BDA0003251221360000082
Figure BDA0003251221360000091
TABLE 3 comparison of Performance of the inventive and comparative methods on a 6 month dataset
Figure BDA0003251221360000092
TABLE 4 comparison of Performance of the inventive and comparative methods on a 7 month dataset
Figure BDA0003251221360000093
As can be seen from the topic consistency results in tables 2, 3 and 4, the interaction relationship of modeling users at two stages of embedding representation and topic reasoning in the user nodes can enable topics to be embedded into certain user relevance, and the topic consistency is further improved. In order to study whether graph prior distribution promotes user interaction in storing user latent topic vectors, fig. 2a and 2b show visual images of the latent topic vectors. Where FIG. 2a is a topic vector inferred from the encoder using a standard Gaussian distribution as a prior variation; fig. 2b is a graph a priori variational inferred topic vector from the encoder. The part marked on the circle can be seen, and the user topic vector with better aggregative property can be obtained by the method. In order to further study the influence of the parameter α on topic continuity in graph prior distribution, fig. 3 shows the relevant change of topic continuity scores and the parameter α on a three-month microblog data set by the method of the present invention.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Reference documents:
[1]Rui Wang,Xuemeng Hu,Deyu Zhou,Yulan He,Yuxuan Xiong,Chenchen Ye,and Haiyang Xu.2020.Neural Topic Modeling with Bidirectional Adversarial Training.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.340-350.
[2]Xiaohui Yan,Jiafeng Guo,Yanyan Lan,and Xueqi Cheng.2013.A biterm topic model for short texts.In In Proceedings of the 22nd international conference on World Wide Web.ACM.1445-1456.
[3]Weihua Hu and Jun’ichi Tsujii.2016.A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).380-386.
[4]Jing Li,Ming Liao,Wei Gao,Yulan He,and Kam-Fai Wong.2016.Topic Extraction from Microblog Posts Using Conversation Structures.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2114-2123.
[5]Ce Zhang and Hady W.Lauw.2020.Topic Modeling on Document Networks with Adjacent-Encoder.Proceedings of the AAAI Conference on Artificial Intelligence 34,04(2020),6737-6745.
[6]Ruifang He,Xuefei Zhang,Di Jin,Longbiao Wang,Jianwu Dang,and Xiangang Li.2018.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In Proceedings of the 27th International Conference on Computational Linguistics.1398-1409.
the present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (4)

1. A microblog topic detection method based on message passing and graph prior distribution is characterized by comprising the following steps:
(1) constructing a user-level social network: taking a user as a network node and an interactive relation as an edge in a network;
(2) encoding user interactions through a message passing mechanism: introducing a graph neural network, integrating content information and structural information of posts in social media by using a message transfer mechanism, and embedding interactive relations among users into a user node embedded representation;
(3) generating topics from an encoder based on a graph prior variation: the user node embedded representation integrated with the user interaction relationship is used as input, a standard Gaussian prior in a variational auto-encoder (VAE) adopting the standard Gaussian distribution as prior distribution is replaced by graph prior distribution containing the user interaction relationship, and the correlation among users is considered in the topic inference process.
2. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (1) specifically comprises:
constructing a user-level social network G (V, E, T) according to the forwarding and comment relations among users; wherein V ═ { V ═ ViI 1. ltoreq. i.ltoreq.n is a node set, viRepresenting the ith user in the social network, wherein n represents the number of the users; e ═ EijI1. ltoreq. i, j. ltoreq. n is the set of edges if viRepresented users i and vjThe represented user j has an interaction, then eij1 is ═ 1; if v isiRepresented users i and vjThe represented user j has never interacted, then eij0; the post published by the user is used as the attribute information of the user node; t ═ T1,t2,…,tnIs a collection of posts, where each post tiContent representing a post of an ith user; in order to relieve data sparsity, a user-based aggregation strategy is adopted for aggregationAll posts of the user comprise a source post, a forwarding post and a reply message; obtaining an adjacency matrix A of the user-level social network according to the interactive relation among the users; replacing each word in the post with a corresponding word embedding representation according to the post set and the T to obtain an attribute vector of each user so as to obtain an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
3. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (2) specifically comprises:
learning a user node embedded representation using network embedding techniques; mitigating sparsity of data by modeling user interactions in a social network; the method comprises the steps that the capability of a graph convolution network GCN for aggregating information of surrounding nodes is considered, the interaction behavior among friends is modeled by the graph convolution network GCN, and user node embedding expression is learned; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
Figure FDA0003251221350000011
Figure FDA0003251221350000012
Figure FDA0003251221350000013
wherein
Figure FDA0003251221350000021
I represents a diagonal matrix, and all diagonal elements are 1;
Figure FDA0003251221350000022
a degree matrix representing a adjacency matrix; x representsAn attribute matrix; w1And W2Is a parameter of the graph convolution network; using ReLU as activation function, H2The method comprises the following steps of (1) forming a matrix by embedding and representing all user nodes in a social network;
the microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
Figure FDA0003251221350000023
Figure FDA0003251221350000024
given user viWith the aim of enabling the user viWith its associated user node vj∈NiThe similarity of (2) is maximized; the related user nodes refer to a first-order neighbor set N in which edges are directly connected in the social networkiA user node of (1); in the formula (5), hiIs H2Embedded representation of user node i, hjIs H2Embedded representation of user node j, huIs H2Embedded representation of a user node u, vuE, representing all user nodes in the social network by V;
based on a GCN message transmission mechanism, related contents of first-order neighbor users are spread to the attributes of the connected users, and the data sparsity of a single user is made up; meanwhile, the similarity between the embedded representations of the user nodes of the connected nodes is higher, and the relevance of friends in the social network is further kept.
4. The microblog topic detection method based on message passing and graph prior distribution as claimed in claim 1, wherein the step (3) specifically comprises:
step (2) the interactive relation between users is coded into the user node embedded representation and is used as the input of the graph prior variation self-coder in step (3); the variational self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of topic posterior distribution, samples from the topic posterior distribution through a heavy parameter skill to obtain a topic vector, and obtains the topic distribution through softmax; each user node is embedded and represented and then reconstructed by a decoder;
according to the microblog topic detection method, a graph prior distribution is constructed to replace a standard Gaussian distribution; the graph prior distribution comprises user interaction relations, so that topic vectors of all users obey corresponding interaction relations among the users in the social network; the graph prior distribution is shown in the following formula:
Figure FDA0003251221350000025
wherein z isiAnd zjIs user vi,vjVector of potential topics of, ps(zi) Is a monomodal edge distribution, where a standard gaussian distribution is used;
Figure FDA0003251221350000026
is a two-state edge distribution, which adopts the following form:
Figure FDA0003251221350000027
alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower bound of the graph prior variation of the self-encoder is obtained, and the formula is as follows:
Figure FDA0003251221350000031
wherein the variation distribution q (z)i,zj|hi,hj) The following form is adopted:
Figure FDA0003251221350000032
wherein, mui,μjAnd
Figure FDA0003251221350000033
is the mean and variance of the variational distribution; c. CijIs xiAnd zjThe correlation coefficient of (a); the final loss function is formulated as follows:
Figure FDA0003251221350000034
the graph prior variation self-encoder obtained from the loss function consists of the following three parts: 1) a variational network which is represented by a user node embedding [ h ]i]As an input, the mean value μ is calculatediSum variance
Figure FDA0003251221350000035
2) Correlation coding of networks with pairs of user nodes [ h ]i,hj]As input, calculating the correlation coefficient c of the potential topic vectors of the two usersij(ii) a 3) Generating a network, as with a variational autocoder using a standard Gaussian distribution as the prior distribution, with a latent variable ziReconstructing the original user node embedding representation for input yields h'i
CN202111052898.3A 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution Active CN113870041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111052898.3A CN113870041B (en) 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111052898.3A CN113870041B (en) 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution

Publications (2)

Publication Number Publication Date
CN113870041A true CN113870041A (en) 2021-12-31
CN113870041B CN113870041B (en) 2024-05-24

Family

ID=

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150213370A1 (en) * 2014-01-27 2015-07-30 Facebook, Inc. Label inference in a social network
CN110232434A (en) * 2019-04-28 2019-09-13 吉林大学 A kind of neural network framework appraisal procedure based on attributed graph optimization
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
US20200293902A1 (en) * 2019-03-15 2020-09-17 Baidu Usa Llc Systems and methods for mutual learning for topic discovery and word embedding
CN112364161A (en) * 2020-09-25 2021-02-12 天津大学 Microblog theme mining method based on dynamic behaviors of heterogeneous social media users

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150213370A1 (en) * 2014-01-27 2015-07-30 Facebook, Inc. Label inference in a social network
US20200293902A1 (en) * 2019-03-15 2020-09-17 Baidu Usa Llc Systems and methods for mutual learning for topic discovery and word embedding
CN110232434A (en) * 2019-04-28 2019-09-13 吉林大学 A kind of neural network framework appraisal procedure based on attributed graph optimization
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
CN112364161A (en) * 2020-09-25 2021-02-12 天津大学 Microblog theme mining method based on dynamic behaviors of heterogeneous social media users

Similar Documents

Publication Publication Date Title
Kaur et al. Multimodal sentiment analysis: A survey and comparison
Polignano et al. A comparison of word-embeddings in emotion detection from text using bilstm, cnn and self-attention
Zhao et al. Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder
CN111079444A (en) Network rumor detection method based on multi-modal relationship
Chen et al. Visual and textual sentiment analysis using deep fusion convolutional neural networks
CN112364161B (en) Microblog theme mining method based on dynamic behaviors of heterogeneous social media users
Pan et al. Social media-based user embedding: A literature review
Phan et al. Fake news detection: A survey of graph neural network methods
Yang et al. Rits: Real-time interactive text steganography based on automatic dialogue model
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
CN113094596A (en) Multitask rumor detection method based on bidirectional propagation diagram
CN112199607A (en) Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood
CN110532378B (en) Short text aspect extraction method based on topic model
Chaudhuri Visual and text sentiment analysis through hierarchical deep learning networks
CN111026866A (en) Domain-oriented text information extraction clustering method, device and storage medium
CN114818724A (en) Construction method of social media disaster effective information detection model
Gan et al. Microblog sentiment analysis via user representative relationship under multi-interaction hybrid neural networks
Bhardwaj Sentiment Analysis and Text Classification for Social Media Contents Using Machine Learning Techniques
Zhang et al. Do sentence-level sentiment interactions matter? sentiment mixed heterogeneous network for fake news detection
Biswas et al. A new ontology-based multimodal classification system for social media images of personality traits
CN113870041B (en) Microblog topic detection method based on message passing and graph priori distribution
CN113870041A (en) Microblog topic detection method based on message passing and graph prior distribution
Deng et al. A depression tendency detection model fusing weibo content and user behavior
CN113111267A (en) Multitask rumor detection method based on bidirectional propagation diagram
CN113870040B (en) Double-flow chart convolution network microblog topic detection method integrating different propagation modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant