CN113870041B - Microblog topic detection method based on message passing and graph priori distribution - Google Patents

Microblog topic detection method based on message passing and graph priori distribution Download PDF

Info

Publication number
CN113870041B
CN113870041B CN202111052898.3A CN202111052898A CN113870041B CN 113870041 B CN113870041 B CN 113870041B CN 202111052898 A CN202111052898 A CN 202111052898A CN 113870041 B CN113870041 B CN 113870041B
Authority
CN
China
Prior art keywords
user
topic
graph
distribution
users
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111052898.3A
Other languages
Chinese (zh)
Other versions
CN113870041A (en
Inventor
贺瑞芳
王浩成
刘焕宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111052898.3A priority Critical patent/CN113870041B/en
Publication of CN113870041A publication Critical patent/CN113870041A/en
Application granted granted Critical
Publication of CN113870041B publication Critical patent/CN113870041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a microblog topic detection method based on message transmission and graph prior distribution, which comprises the following steps of: (1) Based on microblog corpus, constructing a user-level social network according to the interaction relationship among users; (2) message-based user node embedded representation: integrating content information and structure information of posts in social media by using a graph packing network, and embedding interaction relations among users into user node embedded representations; (3) topic generation from the encoder based on graph prior variation: and embedding the user nodes integrated with the user interaction relationship into a representation as input, replacing the standard Gaussian prior distribution of the variation self-encoder with the graph prior distribution containing the user interaction, and considering the correlation among users in the topic inference process. In general, two-stage integration of representation and topic inference from user nodes integrates user interactions. The topics detected by the method pay attention to the correlation among users better, and higher consistency is obtained.

Description

Microblog topic detection method based on message passing and graph priori distribution
Technical Field
The invention relates to the technical fields of natural language processing and social media data mining, in particular to a microblog topic detection method based on message transmission and graph prior distribution.
Background
The rapid development of the internet has brought tremendous progress to our lives. The popularity of social media has enabled everyone to have a platform that can post his own opinion and perspective. Thus, a large amount of short texts are generated every day, and analyzing topics in the texts is an important task, but manual analysis is time-consuming and labor-consuming. The topic model can automatically extract document-topic distribution and topic-word distribution, so that people can analyze texts quickly and master text information.
Traditional topic models, such as LDA, are widely used to discover potential topics from a text corpus. Essentially, these methods reveal potential topics by implicitly capturing word co-occurrence patterns. However, when they are applied to short posts, serious data sparseness issues (i.e., sparse post-level word co-occurrence patterns) are faced.
In order to solve the above problems, there have been some successful studies: (1) polymerization-based process: some studies aggregate multiple posts based on heuristic strategies. Aggregation policies include author-based relationship aggregation, dialogue-based relationship aggregation, and so forth; BTM and the like directly model biterms (i.e., word pairs) generation processes. (2) a representation learning-based method: some methods reveal topics by modeling co-occurrence patterns of potential concepts, and others effectively fuse word context information. (3) social context-based method: such methods combine modeling text information with social network structure information. It models the social network structure and classifies messages as leader messages or follower messages. However, standard methods of learning probabilistic generative models, such as variational techniques (Variational techniques) and Gibbs sampling (Gibbs sampling), have high computational complexity in posterior reasoning, preventing these methods from being applied to complex social media scenarios.
The variational self-encoder (Variational auto-encoder, VAE) is a common parametric reasoning framework for topic detection, which is able to identify the structure of the data and learn its potential distribution. NVDM is a typical VAE-based topic model. Each document is independently input into an inference network, and the mean and variance of topic posterior distribution are calculated. The potential topic vectors are then extracted from the posterior variation distribution. And finally reconstructing the input document by generating a network. It is designed for long documents, but for social media topic detection, IATM is a classical VAE-based neural topic model. It enters a plurality of short posts and learns edge embedded representations in a social network by mining user dynamic interactions. The edge-embedded representation is also independently input into the VAE to infer topic-word distributions at the corpus level. Essentially, the IATM integrates presentation learning and social context on a VAE basis.
While the previous approach embeds user interactions into the edge representation, the VAE assumes that each data point is independent. Thus, correlation between users or posts is weakened when computing latent semantic vectors. In a social network, interactions may mean related relationships or interests. Potential semantic vectors are critical to topic reasoning. It is therefore more reasonable to integrate the interaction features between users into the latent semantic vector.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a microblog topic detection method based on message transmission and graph prior distribution. User interaction information in social media is considered from two stages of user node embedding representation and topic reasoning. In the user node embedded representation stage, the user node embedded representation integrating the social network structure information and the post message content information is learned by utilizing a graph roll-up network, and meanwhile, the interaction relation of the user is embedded into the user node embedded representation. In the topic reasoning stage, graph prior distribution is introduced, and interaction relations are fused into the prior distribution of the VAE, so that the potential topic vectors of the user contain the interaction relations. And finally, obtaining topic distribution considering the relevance of the user by VAE reasoning, and obtaining topics with higher consistency.
The invention aims at realizing the following technical scheme:
the microblog topic detection method based on message transmission and graph prior distribution is characterized by comprising the following steps of:
(1) Constructing a user-level social network: taking a user as a network node and taking an interactive relation as an edge in a network;
(2) User interaction relationships are encoded through a messaging mechanism: introducing a graphic neural network, integrating content information and structure information of posts in social media by using a message transmission mechanism, and embedding interaction relations among users into user node embedded representations;
(3) Topic generation from the encoder based on graph prior variation: the user nodes integrated with the user interaction relationship are embedded into a representation to be used as input, standard Gaussian prior in a variational self-encoder (VAE) which adopts standard Gaussian distribution as prior distribution is replaced by graph prior distribution containing the user interaction relationship, and correlation among users is considered in the topic inference process.
Further, the step (1) specifically includes:
Constructing a user-level social network G= (V, E, T) according to forwarding and comment relations among users; wherein V= { V i |1.ltoreq.i.ltoreq.n } is a node set, V i represents the ith user in the social network, and n represents the number of users; e= { E ij |1+.i.ltoreq.j+.n } is a set of edges, if user i represented by v i interacts with user j represented by v j, E ij =1; if user i represented by v i never interacted with user j represented by v j, then e ij =0; the posts published by the user are used as attribute information of the user node; t= { T 1,t2,...,tn } is a collection of posts, where each post T i represents the content of the posts of the ith user; in order to alleviate the sparsity of data, adopting an aggregation strategy based on users to aggregate all posts of the users, wherein the posts comprise source posts, forwarding posts and reply messages; obtaining an adjacency matrix A of the user-level social network according to the interaction relation among users; according to the post subset and T, replacing each word in the post with a word embedding representation corresponding to the word, and obtaining an attribute vector of each user, thereby obtaining an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
Further, the step (2) specifically includes:
Learning a user node embedded representation using a network embedding technique; each post in a social network is short and informal in terms of expression, so the representation learning of the post is important. Using only Bag of Words (BoW) vectors as a representation of user nodes would face the data sparseness problem, affecting the performance of topic reasoning. According to the theory of social relevance, more similar content is focused among friends. Thus, sparsity of data is mitigated by modeling user interactions in a social network; considering the capability of the graph roll-up network GCN to aggregate surrounding node information, modeling interaction behaviors among friends by using the graph roll-up network GCN, and learning user node embedded representations; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
Wherein the method comprises the steps of I represents a diagonal matrix, and diagonal elements are all 1; /(I)A degree matrix representing the adjacency matrix; x represents an attribute matrix; w 1 and W 2 are parameters of the graph roll-up network; using ReLU as an activation function, H 2 is a matrix of embedded representations of all user nodes in the social network;
Topic detection is an unsupervised approach, so the graph rolling network has no labels for training. The microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
Given user v i, the goal is to maximize the similarity of user v i with its associated user node v j∈Ni; the related user nodes refer to user nodes in a first-order neighbor set N i with edges directly connected in a social network; in equation (5), H i is the embedded representation of user node i in H2, H j is the embedded representation of user node j in H2, H u is the embedded representation of user node u in H 2, and V u e V represents all user nodes in the social network.
Based on a message transmission mechanism of GCN, related contents of first-order neighbor users are transmitted to connected user attributes, so that the data sparsity of a single user is compensated; meanwhile, the similarity between the user node embedded representations of the connected nodes is higher, and the correlation of friends in the social network is further reserved.
Further, the step (3) specifically includes:
Step (2) the interactive relation between users is encoded into the embedded representation of the user nodes and is used as the input of the graph prior variation self-encoder in the step (3); the variable self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean value and the variance of topic posterior distribution, samples the topic posterior distribution through a heavy parameter skill to obtain topic vectors, and obtains topic distribution through softmax; each user node embeds the representation and is reconstructed by a decoder;
The variation of the standard Gaussian distribution serving as the prior distribution can be adopted by a self-encoder to infer potential topics from independent long documents; while for the case of multi-user input it assumes that the users are independent, which weakens the correlation between users in the topic inference process. The a priori distribution in the VAE uses a standard gaussian distribution, which results in independence of the data points. The microblog topic detection method comprises the steps of firstly constructing a graph prior distribution to replace standard Gaussian distribution; the prior distribution of the graph contains user interaction relationships, so that the topic vector of each user obeys the corresponding interaction relationship among users in the social network. The prior distribution of the graph is shown in the following formula:
Where z i and z j are potential topic vectors for user v i,vj, p s(zi) use a standard gaussian distribution; the method adopts the following forms:
Alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower variation bound of the graph prior variation self-encoder is obtained, and the new lower variation bound is shown in the following formula:
Wherein the variation distribution q (z i,zj|hi,hj) takes the form:
wherein mu ij and Is the mean and variance of the variation distribution; c ij is the correlation coefficient of z i and z j; the formula of the final loss function is as follows:
Deriving the graph prior variance from the loss function the self-encoder consists of three parts: 1) A variational network which takes as input the user node embedded representation h i, calculates the mean mu i and the variance 2) The relevance coding network takes a pair of user node pairs h i,hj as input to calculate the relevance coefficients c ij of the two user potential topic vectors; 3) The network is generated, and the original embedded representation of the user node is reconstructed by taking the latent variable z i as input to obtain h' i as the variable self-encoder adopting standard Gaussian distribution as prior distribution. In general, the method keeps interaction between users in two stages of embedding representation and topic reasoning from the user nodes, and considers correlation between friends, so that more coherent topics are obtained.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. In order to alleviate the data sparseness problem in social media topic detection, the method simultaneously considers the text content of the posts and the social network structure, and integrates the user interaction relationship. The structure information is used as supplement, so that the context information in the social network is enriched;
2. To introduce user relevance, user interactions are integrated in two stages, representation and topic reasoning, from user nodes. Comprehensively considering the relevance of the user in the whole period of social media topic detection;
3. In the stage of embedding the user nodes into the representation, the information of friend users around each user can be aggregated on one hand by utilizing a message transmission mechanism of the graph convolutional network, and sparsity is relieved; on the other hand, the social network structure can be integrated into the embedded representation of the user node, and the user interaction relationship is reserved in the embedded representation of the user node;
4. In the topic reasoning stage, topic reasoning is performed by using a variation self-encoder based on graph priors. Unlike conventional variational self-encoder VAEs, the inventive method utilizes a graph prior distribution instead of a standard gaussian distribution. The graph prior distribution takes into account user interactions, the potential topic vectors of the users obey the interaction structure between the users. The final inferred topics have better consistency.
5. The experimental result of the new wave microblog data set in three months fully shows the effectiveness of the method, and proves the effectiveness of the introduced graph prior distribution on microblog topic mining.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention
FIG. 2a is a topic vector visualization inferred from the encoder using a standard Gaussian distribution as a priori standard variation; fig. 2b is a topic vector visualization inferred from the encoder using a graph-a priori variation.
Fig. 3 is a variation of the parameter α of the topic consistency of the evaluation index according to the prior distribution of the graph in the specific embodiment.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Taking a real microblog data set for 3 months as an example, the specific implementation method of the invention is given. The whole system algorithm flow builds a user-level social network, user node embedded representation based on message transmission and topic generation based on graph prior variation self-encoder, see fig. 1. The method comprises the following specific steps:
S1, constructing a user-level social network:
The predecessor worked on the newwave microblogs collected related microblogs covering 50 trending topics in three months of 2014, 5, 6 and 7. Based on the microblog corpus, the embodiment constructs a user-level social network. The method comprises the following specific steps: 1) Filtering users without forwarding or comment relation; 2) Splicing all posts of the user together to serve as attribute information of the user; 3) According to the interaction relation between the users, if the interaction relation exists between the two users, an edge exists between the two user nodes, and otherwise, the edge does not exist. The post text of the user serves as attribute information of the user node in the social network.
Table 1 shows the statistics of the three month dataset, specifically as follows: the 5-month data set comprises 8907 users and 10435 interactions, and the vocabulary size is 5914; the 6-month data set comprises 19293 users and 35962 interactions, and the vocabulary size is 9368; the 7 month dataset included 16990 users in total, 20971 interactions, vocabulary size 9663.
Table 1 microblog dataset statistics
S2, user node embedded representation based on message transmission:
Using only Bag of Words (BoW) vectors as a representation of user nodes would face the data sparseness problem, affecting the performance of topic reasoning. Because each post is short and informal in terms of expression, representation learning of posts in a social network is important. In consideration of the capability of the graph-rolling network to aggregate surrounding node information, the interaction behavior between friends is modeled by utilizing the two layers of GCNs, and the embedded representation of the user node is learned. And based on a message transmission mechanism of the GCN, related contents of neighbor users are transmitted to connected user attributes, so that the data sparsity of a single user is compensated. Meanwhile, the similarity between the user node embedded representations of the connected nodes is higher, and the correlation of friends in the social network is further reserved. The loss function for this step is shown in the following formula:
s3, topic generation of a self-encoder based on graph prior variation:
with the user node embedded representation as input, the topic is inferred from the encoder using the variation. The variation self-encoder comprises an encoder and a decoder, wherein the encoder calculates the mean and variance of the topic posterior distribution, and the mean and variance are shown in the following formula:
μi=MLP(hi)
Where h i denotes the i-th user node embedded representation, μ i, The mean and variance are represented respectively. MLP stands for Multi-Layer Perceptron (MLP). The potential topic vector z i can be sampled from the posterior distribution by the heavy parameter skill z i=μi+∈*σi. Topic distribution θ= (p (t 1|h),p(t2|h),...,p(tk |h)) can be derived from the softmax function:
θi=softmax(zi)
Where h represents the user representation of the input, t 1 represents the first topic, and p (t 1 |h) represents the probability of the first topic appearing. K represents the total number of topics. Each user node embeds the representation and is reconstructed by a decoder network, which also selects the MLP. The parameter W of the decoder is topic-word distribution phi word=(p(w|t1),p(w|t2),...,p(w|tK) of the corpus). The specific formula is as follows:
di=softmax(θiW)
h′i=f(Wddi+bd)
Where p (w|t 1) represents the probability of each word appearing under the first topic. d i represents a probability value of occurrence of each word in attribute information of each user node. W d denotes a parameter of the neural network, and b d denotes a deviation of the neural network. h' i denotes the user node embedded representation of the decoder reconstruction.
Variations with a standard gaussian distribution as a priori distribution can infer potential topics from the independent long documents from the encoder. While for the case of multi-user input it assumes that the users are independent, which weakens the correlation between users in the topic inference process. The invention constructs a graph prior distribution to replace standard Gaussian. The prior distribution includes user interactions such that the topic vector of each user is subject to correspondence among users in the social network. The loss function is then calculated from the new variance lower bound, and the specific loss function is formulated as follows:
the meaning of the symbols in the formula is as described above.
In the implementation process, post text of each user node is preprocessed first. Through aggregation, each user's post text will contain 50 words. Word embedding is randomly initialized and its dimension set to 200. In the GCN, the dimension of the hidden layer is set to 200. In the variable self-encoder, the dimension of the first layer encoder is set to 200. The learning rate was set to 0.001. The present method uses dropout in both the GCN and the associated VAE to avoid overfitting. Adam is used to optimize the loss function of each module.
To verify the effectiveness of the method of the invention, the method of the invention (MGTM) was compared with the currently advanced and representative method (BAT [1]、BTM[2]、LCTM[3]、LeadLDA[4]、AdjEnc[5]、IATM[6]) and with variants of the method of the invention (MGTM (Standard Gaussian)).
BAT explored the application of bi-directional countermeasure training in a neural topic model. It is designed for long documents and faces serious data sparseness problems when applied to short text.
BTMs learn topics by directly modeling the generation of word pairs in the overall corpus.
LCTM reveal topics by modeling co-occurrence patterns of potential concepts that are used to capture the concept similarity of the vocabulary.
LeadLDA distinguish the posts as leader posts and follower posts and consider the leader information and follower posts to contain different degrees of key words.
AdjEnc introduces topic reasoning into the network structure in the structured long documents of academic papers, web pages and the like.
The IATM models dynamic interactions of users, embeds edges of learning interaction perception, and generates topics by utilizing neural variation reasoning.
MGTM (Standard Gaussian) degenerates into a standard gaussian distribution as a priori, verifying the effect of the graph a priori distribution.
The evaluation index of the model performance adopts topic coherence (topic coherence) as follows:
Tables 2,3, and 4 show the topic coherence results of the present method and all the comparison methods on the three month microblog dataset, respectively. For each dataset, the first 10 (n=10), 15 (n=15), 20 (n=20) word consistency scores of the inferred topics at the topic number k=50, 100 were recorded. The higher the topic coherence, the better the model performance.
Table 2 comparison of the performance of the inventive method with the comparative method on a 5 month dataset
Table 3 comparison of the performance of the inventive method with the comparative method on the 6 month data set
Table 4 comparison of the performance of the inventive method with the comparative method on a 7 month data set
From the topic consistency results in tables 2,3 and 4, the interaction relation between the user node embedded representation and the topic reasoning two-stage modeling user can enable topics to be embedded into certain user relevance, and the topic consistency is further improved. To investigate whether the prior distribution improves the preservation of user interactions for the user's potential topic vector, fig. 2a and 2b illustrate visual images of the potential topic vector. Wherein FIG. 2a is a topic vector inferred from the encoder using a standard Gaussian distribution as a priori variance; fig. 2b is a topic vector inferred from the encoder by the graph prior variation. The method can be used for obtaining the user topic vector with better aggregation. To further investigate the effect of parameter α on topic coherence in a graph prior distribution, fig. 3 illustrates the relative variation of topic coherence score and parameter α over a three month microblog dataset with the method of the present invention.
The above is intended to schematically illustrate the technical solution of the present invention, which is not limited to the embodiments described above. Numerous specific modifications can be made by those skilled in the art without departing from the spirit of the invention and scope of the claims, which are within the scope of the invention.
Reference is made to:
[1]Rui Wang,Xuemeng Hu,Deyu Zhou,Yulan He,Yuxuan Xiong,Chenchen Ye,and Haiyang Xu.2020.Neural Topic Modeling with Bidirectional Adversarial Training.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.340-350.
[2]Xiaohui Yan,Jiafeng Guo,Yanyan Lan,and Xueqi Cheng.2013.A biterm topic model for short texts.In In Proceedings of the 22nd international conference on World Wide Web.ACM.1445-1456.
[3]Weihua Hu and Jun'ichi Tsujii.2016.A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).380-386.
[4]Jing Li,Ming Liao,Wei Gao,Yulan He,and Kam-Fai Wong.2016.Topic Extraction from Microblog Posts Using Conversation Structures.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2114-2123.
[5]Ce Zhang and Hady W.Lauw.2020.Topic Modeling on Document Networks with Adjacent-Encoder.Proceedings of the AAAI Conference on Artificial Intelligence 34,04(2020),6737-6745.
[6]Ruifang He,Xuefei Zhang,Di Jin,Longbiao Wang,Jianwu Dang,and Xiangang Li.2018.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In Proceedings of the 27th International Conference on Computational Linguistics.1398-1409.
The invention is not limited to the embodiments described above. The above description of specific embodiments is intended to describe and illustrate the technical aspects of the present invention, and is intended to be illustrative only and not limiting. Numerous specific modifications can be made by those skilled in the art without departing from the spirit of the invention and scope of the claims, which are within the scope of the invention.

Claims (2)

1. The microblog topic detection method based on message transmission and graph prior distribution is characterized by comprising the following steps of:
(1) Constructing a user-level social network: taking a user as a network node and taking an interactive relation as an edge in a network;
(2) User interaction relationships are encoded through a messaging mechanism: introducing a graphic neural network, integrating content information and structure information of posts in social media by using a message transmission mechanism, and embedding interaction relations among users into user node embedded representations; the method specifically comprises the following steps:
learning a user node embedded representation using a network embedding technique; user interactions in the social network are modeled to mitigate sparseness of the data; considering the capability of the graph roll-up network GCN to aggregate surrounding node information, modeling interaction behaviors among friends by using the graph roll-up network GCN, and learning user node embedded representations; specifically, the microblog topic detection method adopts two layers of GCNs, and the following formula is shown:
Wherein the method comprises the steps of I represents a diagonal matrix, and diagonal elements are all 1; /(I)A degree matrix representing the adjacency matrix; x represents an attribute matrix; w 1 and W 2 are parameters of the graph roll-up network; using ReLU as an activation function, H 2 is a matrix of embedded representations of all user nodes in the social network;
The microblog topic detection method uses an unsupervised loss function, and is shown in the following formula:
Given user v i, the goal is to maximize the similarity of user v i with its associated user node v j∈Ni; the related user nodes refer to user nodes in a first-order neighbor set N i with edges directly connected in a social network; in equation (5), H i is the embedded representation of user node i in H 2, H j is the embedded representation of user node j in H 2, H u is the embedded representation of user node u in H 2, V u ε V represents all user nodes in the social network;
Based on a message transmission mechanism of GCN, related contents of first-order neighbor users are transmitted to connected user attributes, so that the data sparsity of a single user is compensated; meanwhile, the similarity between the user node embedded representations of the connected nodes is higher, and the correlation of friends in the social network is further reserved;
(3) Topic generation from the encoder based on graph prior variation: embedding representation into user nodes integrating user interaction relations as input, replacing standard Gaussian prior in a variational self-encoder (VAE) which adopts standard Gaussian distribution as prior distribution with graph prior distribution containing the user interaction relations, and considering correlation among users in the topic inference process; the method specifically comprises the following steps:
Step (2) the interactive relation between users is encoded into the embedded representation of the user nodes and is used as the input of the graph prior variation self-encoder in the step (3); the variable self-encoder adopting standard Gaussian distribution as prior distribution comprises an encoder and a decoder, wherein the encoder calculates the mean value and the variance of topic posterior distribution, samples the topic posterior distribution through a heavy parameter skill to obtain topic vectors, and obtains topic distribution through softmax; each user node embeds the representation and is reconstructed by a decoder;
The microblog topic detection method comprises the steps of firstly constructing a graph prior distribution to replace standard Gaussian distribution; the prior distribution of the graph contains user interaction relations, so that the topic vector of each user obeys the corresponding interaction relations among users in the social network; the prior distribution of the graph is shown in the following formula:
Where z i and z j are potential topic vectors for user v i,vj, p s(zi) is a singlet edge distribution, here using a standard gaussian distribution; is a bimodal edge distribution, taking the form:
Alpha is a hyper-parameter, I represents a diagonal matrix; based on the graph prior distribution, a new lower variation bound of the graph prior variation self-encoder is obtained, and the new lower variation bound is shown in the following formula:
Wherein the variation distribution q (z i,zj|hi,hj) takes the form:
wherein mu ij and Is the mean and variance of the variation distribution; c ij is the correlation coefficient of z i and z j; the formula of the final loss function is as follows:
Deriving the graph prior variance from the loss function the self-encoder consists of three parts: 1) A variational network which takes as input the user node embedded representation h i, calculates the mean mu i and the variance 2) The relevance coding network takes a pair of user node pairs h i,hj as input to calculate the relevance coefficients c ij of the two user potential topic vectors; 3) The network is generated, and the original embedded representation of the user node is reconstructed by taking the latent variable z i as input to obtain h i as the variable self-encoder adopting standard Gaussian distribution as prior distribution.
2. The microblog topic detection method based on message passing and graph prior distribution according to claim 1, wherein the step (1) specifically includes:
Constructing a user-level social network G= (V, E, T) according to forwarding and comment relations among users; wherein V= { V i |1.ltoreq.i.ltoreq.n } is a node set, V i represents the ith user in the social network, and n represents the number of users; e= { E ij |1+.i.ltoreq.j+.n } is a set of edges, if user i represented by v i interacts with user j represented by v j, E ij =1; if user i represented by v i never interacted with user j represented by v j, then e ij =0; the posts published by the user are used as attribute information of the user node; t= { T 1,t2,…,tn } is a collection of posts, where each post T i represents the content of the posts of the ith user; in order to alleviate the sparsity of data, adopting an aggregation strategy based on users to aggregate all posts of the users, wherein the posts comprise source posts, forwarding posts and reply messages; obtaining an adjacency matrix A of the user-level social network according to the interaction relation among users; according to the post subset and T, replacing each word in the post with a word embedding representation corresponding to the word, and obtaining an attribute vector of each user, thereby obtaining an attribute matrix X of the social network; the word embedding representation corresponding to each word is obtained by random initialization.
CN202111052898.3A 2021-09-07 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution Active CN113870041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111052898.3A CN113870041B (en) 2021-09-07 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111052898.3A CN113870041B (en) 2021-09-07 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution

Publications (2)

Publication Number Publication Date
CN113870041A CN113870041A (en) 2021-12-31
CN113870041B true CN113870041B (en) 2024-05-24

Family

ID=78995057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111052898.3A Active CN113870041B (en) 2021-09-07 2021-09-07 Microblog topic detection method based on message passing and graph priori distribution

Country Status (1)

Country Link
CN (1) CN113870041B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232434A (en) * 2019-04-28 2019-09-13 吉林大学 A kind of neural network framework appraisal procedure based on attributed graph optimization
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
CN112364161A (en) * 2020-09-25 2021-02-12 天津大学 Microblog theme mining method based on dynamic behaviors of heterogeneous social media users

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9552613B2 (en) * 2014-01-27 2017-01-24 Facebook, Inc. Label inference in a social network
US11568266B2 (en) * 2019-03-15 2023-01-31 Baidu Usa Llc Systems and methods for mutual learning for topic discovery and word embedding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232434A (en) * 2019-04-28 2019-09-13 吉林大学 A kind of neural network framework appraisal procedure based on attributed graph optimization
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
CN112364161A (en) * 2020-09-25 2021-02-12 天津大学 Microblog theme mining method based on dynamic behaviors of heterogeneous social media users

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈静 ; 刘琰 ; 王煦中.基于概率生成模型的微博话题传播群体划分方法.计算机科学.43(8),全文. *
鲁骁 ; 李鹏 ; 王斌 ; 李应博 ; 房婧.一种基于用户互动话题的微博推荐算法.中文信息学报.30(3),全文. *

Also Published As

Publication number Publication date
CN113870041A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Cambria et al. Benchmarking multimodal sentiment analysis
CN109033069B (en) Microblog theme mining method based on social media user dynamic behaviors
CN112364161B (en) Microblog theme mining method based on dynamic behaviors of heterogeneous social media users
Liu et al. Sentiment recognition for short annotated GIFs using visual-textual fusion
CN113094596A (en) Multitask rumor detection method based on bidirectional propagation diagram
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
Hrga et al. Deep image captioning: An overview
Liang et al. Factorized contrastive learning: Going beyond multi-view redundancy
Chang et al. Emotion-infused deep neural network for emotionally resonant conversation
CN113870040B (en) Double-flow chart convolution network microblog topic detection method integrating different propagation modes
CN112199607A (en) Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood
Dedeepya et al. Detecting cyber bullying on twitter using support vector machine
Wang et al. Multimodal graph convolutional networks for high quality content recognition
Biswas et al. A new ontology-based multimodal classification system for social media images of personality traits
Pan et al. Sentiment analysis using semi-supervised learning with few labeled data
Zhang et al. Simre: Simple contrastive learning with soft logical rule for knowledge graph embedding
Zhang et al. Do sentence-level sentiment interactions matter? sentiment mixed heterogeneous network for fake news detection
Hu et al. LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model
CN113870041B (en) Microblog topic detection method based on message passing and graph priori distribution
He et al. VIEMF: Multimodal metaphor detection via visual information enhancement with multimodal fusion
Bhardwaj Sentiment Analysis and Text Classification for Social Media Contents Using Machine Learning Techniques
Sajinika et al. Twitter Sentiment Analysis and Topic Modeling for Online Learning
Deng et al. A depression tendency detection model fusing weibo content and user behavior
Khlyzova et al. On the complementarity of images and text for the expression of emotions in social media
Dai et al. Vision-language joint representation learning for sketch less facial image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant