CN112199607A - Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood - Google Patents

Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood Download PDF

Info

Publication number
CN112199607A
CN112199607A CN202011192126.5A CN202011192126A CN112199607A CN 112199607 A CN112199607 A CN 112199607A CN 202011192126 A CN202011192126 A CN 202011192126A CN 112199607 A CN112199607 A CN 112199607A
Authority
CN
China
Prior art keywords
user
sequence
topic
embedding
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011192126.5A
Other languages
Chinese (zh)
Inventor
贺瑞芳
刘宏宇
朱永凯
王浩成
韩迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011192126.5A priority Critical patent/CN112199607A/en
Publication of CN112199607A publication Critical patent/CN112199607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood, which comprises the following steps: (1) constructing a user-level dialogue network; (2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity; (3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation; (4) topic generation based on neurovariational reasoning: the user sequence representation is used as the input of the neural variation reasoning, and the inherent complementarity between the content and the structure is balanced in a self-adaptive mode, so that the theme with better consistency is mined.

Description

Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to a microblog topic mining method based on parallel social context fusion in a variable neighborhood.
Background
The emergence of social media websites (e.g., the Singlean microblog, etc.) has enabled the form of content on the Internet to change dramatically. Microblogging allows users to publish and browse information on it, and has strong social attribute functions, such as forwarding and commenting. Microblog platforms store huge amounts of text data and grow at an alarming rate each day. The microblog text content contains a large amount of information, and topic information is mined from the microblog text content and can be used for topic recommendation, emergency detection, accurate marketing and the like. At present, the text topic mining technology has better effect when being applied to text data such as news, articles and the like. However, the length of the microblog text is short and is generally limited to 140 characters, and the difficulty of processing the microblog text is greatly increased due to the characteristics of sparse microblog information, random words used by the microblog and the like. Therefore, the topic mining technology facing the microblog needs to adopt a method different from the traditional topic mining method.
Currently, the related research of microblog topic mining mainly comprises: (1) co-occurrence patterns across documents are utilized. The method gathers short messages into a long pseudo document according to heuristic rules such as authors, hash labels and the like or topic attributes of texts, and then utilizes a topic model with a three-layer Bayesian structure to mine potential topics; or the generation of word pairs in the whole corpus is directly modeled, so that the data sparsity of the short text is reduced. (2) Short text semantics are utilized. The method uses the characteristic that word embedding contains rich semantic information, takes short text as a set formed by word embedding, assumes the distribution of theme-word as multi-dimensional Gaussian distribution, and then deduces the theme by using a layered Bayes model; or semantic association between words and contexts in the short text is integrated to model topics, and the semantics of the short text is deeply understood to a certain extent. (3) Social network context information is utilized. The method introduces the structural characteristics of the social network, such as a user-forwarding network and a user-follower network, and supplements static context information for the microblog text content, so as to find more word co-occurrence characteristics; or dynamic context of the social network is introduced, and topics are inferred by mining user behavior characteristics such as dynamic interaction among users and different user concerns.
Although the above methods have achieved good performance, only the microblog text content is modeled or the first-order proximity of the social network is considered at the same time, ignoring the impact of the larger user neighborhood in the microblog conversation on topic inference. As in a larger user neighborhood, users may talk about highly related topics, while users talking about the same topic may have similar opinions about the topic. In addition, independent content and structure representations are learned for the user[1]The non-linear correlation between the two is ignored, and the content representation and the structure representation are spliced by equal weight to ignore different importance of the two in topic inference. These provide a favorable clue for social media-based microblog topic mining tasks.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a microblog topic mining method based on parallel social context fusion in a variable field. In order to capture neighborhood contexts with different specifications, the method constructs a user-level dialogue network based on forwarding and comment relations, wherein nodes represent users, and edges represent forwarding or comment relations between pairs of users. By setting different random walk lengths, user proximity degrees of different orders are firstly modeled, then a self-fused network is designed to capture nonlinear association between parallel content and structural context in the proximity, different influences of users on topics on an attention mechanism modeling sequence are introduced, the two are finally seamlessly integrated into user sequence embedding, and a microblog topic with better consistency is generated through neural variation reasoning. Compared with the existing model, the microblog subject mined by the method is optimal in the evaluation index of Topic Coherence Score (Topic Coherence Score).
The purpose of the invention is realized by the following technical scheme:
a microblog topic mining method based on parallel social context fusion in variable domains comprises the following steps:
(1) building user-level dialog networks
And the user is regarded as a node in a conversation network, and all microblogs related to the user, including a source microblog, a comment microblog and a forwarding microblog, are gathered into a document and regarded as text information of the node. If microblog forwarding or comments exist among users in the session network, nodes referring to the users are connected. Constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,
Figure BDA0002753033090000029
is a set of edges in the dialogue network G, and T is a set of text information attached to a node. v. ofiIndicating the ith user in the V,
Figure BDA0002753033090000021
referring to user viThe identity of (a) of (b),
Figure BDA0002753033090000022
referring to user viWherein n is a document
Figure BDA0002753033090000023
The number of words in (2).
(2) Parallel social context sequence: setting different random walk lengths to obtain parallel content and structure context sequences containing different-order user proximity
In the user-level dialogue network, random walk is carried out by taking any node as a starting point, and a truncated user sequence can be obtained. By setting different random walk lengths, neighborhood contexts with different specifications are captured for a user. Suppose S ═ v1,v2,...,vk) Representing a sequence of users, a sequence of user content of samples
Figure BDA0002753033090000024
And user structure sequence
Figure BDA0002753033090000025
Is a pair of parallel social context sequences, each containing contentInformation and structure information, where k denotes the length of the random walk.
(3) Self-converged network representation: capturing non-linear associations between textual content and network structure[2]And the influence of different users on the theme on the attention machine modeling sequence is introduced to obtain the user sequence representation
Given a sequence of user content
Figure BDA0002753033090000026
Text information of user
Figure BDA0002753033090000027
Each word w iniSubstitution into corresponding word embedding
Figure BDA0002753033090000028
Thereby obtaining a text embedding matrix Ei=(w1,w2,...,wn) Where d' represents the dimension of word embedding. Embedding matrix E for textiPreserving potential local syntactic and semantic information in user text using convolution and max-pooling operations[3]Specifically, the encoding is embedded into the user content, which is detailed in formula (1):
vi=max(CNN(Ei)) (1)
original user content sequence via convolution and pooling operations
Figure BDA0002753033090000031
Conversion to user content embedding sequence (v)1,v2,...,vk). Using it as user structure sequence
Figure BDA0002753033090000032
Then a bi-directional LSTM is input to capture forward context and backward context information over a sequence for the user:
Figure BDA0002753033090000033
Figure BDA0002753033090000034
in equations (2) and (3)
Figure BDA0002753033090000035
And
Figure BDA0002753033090000036
is a hidden state corresponding to the forward LSTM and the backward LSTM, and is formed by splicing
Figure BDA0002753033090000037
And
Figure BDA0002753033090000038
user embedding h can be obtainediEmbedding the sequence (h) at the user1,h2,...,hk) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, which is detailed in formula (4):
1,α2,...,αk)=att(h1,h2,...,hk) (4)
in the formula (4), αiRepresenting a user viThe contribution to the topic is specifically calculated in formula (5). First embed the user in hiNon-linear transformation is carried out, and then similarity is calculated with the attention vector q of the user to obtain alphai
αi=qT·tanh(W·hi+b) (5)
W and b in equation (5) are parameters of the neural network, are shared simultaneously with the user attention vector q for all user sequences and user embeddings, and tanh (-) is a non-linear activation function. Further normalization is carried out through a softmax function to obtain a user viFor topic importance, see formula (6):
Figure BDA0002753033090000039
in the formula (6), NiRepresenting a user viIncluding v, based on the sequence of the neighbori,αjRepresenting a neighbour vjImportance to the topic. By embedding all users on the weighted sequence, a user sequence embedding s is obtained which captures the non-linear association between parallel content and structural context within the variable social neighborhood, see equation (7).
Figure BDA00027530330900000310
Where N represents all users on the current sequence. To obtain the user sequence embedding s, the following objective function is minimized:
Figure BDA00027530330900000311
in the formula (8), LseqExpressed as a loss function learning the user sequence embedding s, Ci={vj|vj∈NiC is less than or equal to | j-i |, j ≠ i } represents that the user v isiC is the window size. p (v)j|vi) Representing a given user viNeighbor vjThe conditional probability of (2) is formalized as shown in formula (9):
Figure BDA0002753033090000041
wherein h iskIs user viOf arbitrary sequence-based neighbors vk∈CiIs embedded. And (3) further reducing the calculation cost of the conditional probability in the formula (9) by utilizing a negative sampling technology to obtain the following optimized objective function, see the formula (10):
Figure BDA0002753033090000042
where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid (sigmoid) function, and L denotes the number of negative samples.
(4) Topic generation based on neurovariational reasoning: user sequence representation as a neural variational reasoning[4]The internal complementarity between the content and the structure is balanced in a self-adaptive mode, and therefore the theme with better consistency is mined.
Inferring document-topic distribution θ using neurovariational reasoningd=(p(t1|d),p(t2|d),...,p(tKId)) and topic-term distribution phiw=(p(w|t1),p(w|t2),...,p(w|tK) Where d represents a document, t)iDenotes the ith topic, K denotes the number of topics, and w denotes words. p (t)iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)i) (i ═ 1, 2., K) denotes the probability that the word w belongs to the ith topic.
Document-topic distribution: given a user sequence embedding s, it is first mapped into a non-linear implicit space hencThe method comprises the following steps:
henc=ReLU(Wh·s+bh) (11)
wherein WhAnd bhIs a parameter of the encoder, ReLU is a non-linear activation function. Assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions2Can be obtained by linear transformation, see equation (12) (13):
μ=Wμ·henc+Wμ (12)
log(σ2)=Wμ·henc+bσ (13)
wherein Wμ、Wμ、Wμ、bσAre parameters of the encoder.
Further obtaining potential semantic vector by using reparameterization skill
Figure BDA0002753033090000043
The formalization is shown in formula (14):
z=μ+∈×σ (14)
in formula (14), the element is sampled from Gaussian distribution N (0, I), and the latent semantic vector z is normalized by utilizing a softmax function to obtain a document-subject distribution thetad
Topic-word distribution: topic-word distribution in textwCan be regarded as the parameters of the decoder, see formula (15):
hdec=softmax(φw×(θd)T) (15)
reconstructing the user sequence embedding s by the decoder to obtain a reconstructed user sequence embedding s', calculating as formula (16), wherein W(d)And b(d)Are all parameters of the decoder:
s′=ReLU(Wdhdec+bd) (16)
for topic generation, the objective function for this section is formula (17):
Figure BDA0002753033090000051
in formula (17), LgenThe closeness of the variational distribution q (z), which is a prior gaussian distribution N (0, I), to the true posterior distribution p (zs) is measured using KL divergence, expressed as loss function values for learning document-topic distributions and topic-term distributions.
By combining the formulas (10) and (17), an overall objective function is defined to mine the microblog potential topics, as detailed in the formula (18), wherein λ is a trade-off LseqAnd LgenThe hyper-parameters of (a):
L=(1-λ)Lseq+λLgen (18)
compared with the prior art, the technical scheme of the invention has the following beneficial effects:
(1) in order to solve the problems of sparse microblog text data, random word use and the like, the method simultaneously considers social media content and a social network topological structure, and enriches the context information of the microblog text;
(2) in order to capture the user proximity of different orders, the method of the invention sets different random walk lengths on different dialogue networks constructed by different data sets, and captures the field context of different specifications for the user according to different interaction characteristics;
(3) in order to capture the nonlinear association between the text content and the network structure in the user neighborhood, the method utilizes a self-fused network representation and an attention mechanism to seamlessly integrate the text content and the network structure into a user sequence for embedding;
(4) in order to generate a theme with better consistency, the method embeds the user sequence capturing the nonlinear association between the parallel content and the structural context in the variable social domain into the neural variational reasoning, and adaptively balances the intrinsic complementarity and different importance of the two in topic inference to generate the theme;
(5) the experimental result on the real Sina microblog data set shows the effectiveness of the method, and proves the effectiveness of capturing the user proximity of different orders and the nonlinear association between modeling content and structure on microblog topic mining.
Drawings
Fig. 1a and fig. 1b are frame diagrams of a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood provided by the invention; wherein the left part of FIG. 1a is the construction of the user-level dialog network in the embodiment; FIG. 1a is a diagram illustrating the acquisition of parallel social context sequences, according to an embodiment; FIG. 1b shows, in the left part, a self-converged network representation in an embodiment; FIG. 1b right part is the generation of topics based on neuro-variational reasoning in an embodiment;
FIG. 2a is a variation of topic coherence score over a 5-month data set with random walk length in an embodiment;
FIG. 2b is a variation of topic coherence score over a 6 month data set with random walk length in an embodiment;
FIG. 2c is a variation of topic continuity score over a 7 month data set with random walk length in an embodiment.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and the detailed description. It should be understood that the embodiments described herein are only for illustrating the present invention and are not to be construed as limiting the present invention.
The specific implementation method of the present invention is given by taking 3 real microblog data sets as an example, and the overall framework of the method is shown in fig. 1a and 1 b. The whole system algorithm process comprises four steps of user-level dialogue network input, acquisition of parallel social context sequences containing different-order user proximity, self-fusion network representation of nonlinear association between captured content and structure, and theme generation based on neuro-variational reasoning.
The method comprises the following specific steps:
(1) user-level conversational web input
On a Sina microblog platform, the predecessor collects related microblogs covering 50 hot topics in three months of 5 months, 6 months and 7 months in 2014 by using a topic index search application programming interface (hashtag-search API). The invention takes the 3 common real microblog data sets as original corpora, and carries out pretreatment according to the following steps to construct a user-level conversation network: 1) filtering users without forwarding or commenting relations; 2) and aggregating all microblogs related to the same user, including a source microblog, a forwarding microblog and a comment microblog, into a document as text information of a node pointed by the user.
Table 1 shows detailed statistical information for three data sets, as follows: the 5-month data set comprises 44395 users in total, wherein 64292 users who have forwarding or comment relations account for 70893 microblogs; the 6 month data set includes 89979 users in total, of which 151427 total 163420 microblogs for users with forwarding or commenting relationships; the 7 month dataset includes 119269 users in total, of which 178154 total 188657 microblogs for users with forwarding or commenting relationships. FIG. 1a illustrates a user-level dialog network constructed from inter-user forwarding or comment relationships.
TABLE 1 microblog data set statistics
Figure BDA0002753033090000061
(2) Obtaining parallel social context sequences
And (3) processing the user-level dialogue network constructed in the previous step as follows:
and carrying out random walk by taking any node as a starting point to obtain a truncated user sequence. By setting different random walk lengths, neighborhood contexts with different specifications are captured for the user. Suppose S ═ v1,v2,...,vk) Representing a sequence of users, obtained by random walk sampling of length k, the sequence of user contents
Figure BDA0002753033090000062
And user structure sequence
Figure BDA0002753033090000071
Is a pair of parallel social context sequences containing content information and structural information, respectively.
(3) Self-converged network representation
To capture the non-linear correlation between parallel content and structural context within a variable social neighborhood, we train the model according to the following objective function, learning the user sequence embedding s:
Figure BDA0002753033090000072
the meaning of the symbols in the formula is as described above. By training the objective function, the model seamlessly integrates the content information and the structure information of the user and models the influence of different users on the topic of discussion on the sequence.
(4) Topic generation based on neural variational reasoning
To integrate the intrinsic complementarity between social media content and network structure to infer topics, we reconstruct the user sequence embedded s input variational self-encoder as follows:
Figure BDA0002753033090000073
the meaning of the symbols in the formula is as described above. By training the objective function, the model adaptively balances different importance of content and structure to the topic to generate the topic together.
The model overall objective function is as follows:
L=(1-λ)Lseq+λLgen
in the specific implementation process, various hyper-parameters are set in advance, namely the embedding dimension is 200, the weighing coefficient lambda is 0.9, and the number of random walks is 10, so that the topic of microblog data is deduced. In order to capture user neighborhood contexts with different specifications, different random walk step lengths are set on different data sets, specifically, the random walk step length of a data set of 5 months is set to be 7, the random walk step length of a data set of 6 months is set to be 10, and the random walk step length of a data set of 7 months is set to be 3.
To verify the effectiveness of the method of the invention, the method of the invention (PCFTM) was compared with currently advanced and representative models (BTM, LCTM, LeadLDA, ForumLDA, IATM) and two variants of the method of the invention (PCFTM (-seq), PCFTM (-fus)).
Btm (btrm Topic model) assumes that both words in a word pair belong to the same Topic, and Topic inference is performed by modeling the generation of all word pairs in the entire corpus.
LCTM (Laten Concept Topic model) introduces word embedding to enhance the understanding of the short text semantics in order to solve the data sparsity problem of the short text. The model simultaneously introduces a new hidden variable-concept (concept) to capture the semantic similarity of the words, and assumes that the subject is the distribution on the concept and the concept is the distribution on the word embedding.
LeadLDA[5]And constructing a conversation tree according to the forwarding and replying relations among the microblogs, and further deducing hidden topics according to the topic dependency relation of the leader message and the follower message on the conversation tree.
ForumLDA models the topic generation process by distinguishing whether the forwarded microblog is related to the original microblog in topic.
The IATM (Interaction-Aware Topic Model) considers text content and dynamic user behaviors in a social network at the same time, and mines topics by modeling user dynamic Interaction and different user concerns and then using neuro-variational reasoning.
The PCFTM (-seq) considers only the first-order proximity of the user and fuses content and structural information for topic generation based on neural variational reasoning.
The PCFTM (-fus) considers user neighborhoods of different specifications, but simply combines the content and structural context of the user sequence for topic generation based on neuro-variational reasoning.
The evaluation index of the experimental performance adopts a topic coherence score (topic coherence), and the formula is as follows:
Figure BDA0002753033090000081
tables 2, 3, and 4 show topic coherence results of the model and all comparison methods on three microblog data sets, respectively. For each data set, consistency score values of top 10(N is 10), 15(N is 15), and 20(N is 20) words of the inferred topic when the topic number K is 50 and 100 are recorded. Higher topic continuity score values indicate better performance of the model.
TABLE 2 comparison of Performance of the method of the present invention with other methods on a 5 month dataset
Figure BDA0002753033090000082
TABLE 3 comparison of Performance of the method of the present invention with other methods on a 6 month dataset
Figure BDA0002753033090000083
Figure BDA0002753033090000091
TABLE 4 comparison of Performance of the method of the present invention with other methods on a 7 month dataset
Figure BDA0002753033090000092
As can be seen from the topic coherence results in tables 2, 3 and 4, the method provided by the invention achieves greater performance improvement by modeling user neighborhoods with different specifications and capturing the nonlinear association between the content and the structural context. In order to further observe proper proximity settings on different data sets, fig. 2a to 2c show the relevant changes of topic continuity scores and random walk lengths on three microblog data sets by the method of the present invention, which illustrates the effectiveness of the microblog topic mining method based on the fusion of parallel social contexts in a variable neighborhood provided by the present invention.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Reference documents:
[1]He R,Zhang X,Jin D,et al.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In:Proc.of the 27th International Conference on Computational Linguistics.2018:1398-1409.
[2]Liu J,He Z,Wei L,et al.Content to node:Self-translation network embedding.In:Proc.of the 24th International Conference on Knowledge Discovery&Data Mining.2018:1794-1802.
[3]Liu J,Li N,He Z,et al.Network Embedding with Dual Generation Tasks.In:Proc.of the 28th International Joint Conference on Artificial Intelligence,2019:5102-5108.
[4]Srivastava A,Sutton C.Autoencoding Variational Inference for Topic Models.In:Proc.of the 5th International Conference on Learning Representations,2017.
[5]Li J,Liao M,Gao W,et al.Topic Extraction from Microblog Posts Using Conversation Structures.In:Proc.of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1722–1731.

Claims (5)

1. a microblog topic mining method based on parallel social context fusion in a variable neighborhood is characterized by comprising the following steps:
(1) constructing a user-level dialogue network;
(2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity;
(3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation;
(4) topic generation based on neurovariational reasoning: and taking the user sequence representation as the input of the neural variation reasoning, and adaptively balancing the inherent complementarity between the content and the structure so as to mine the theme with better consistency.
2. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (1) specifically comprises the following steps:
the method comprises the steps that users are regarded as nodes in a conversation network, all microblogs related to the corresponding users, including source microblogs, comment microblogs and forwarding microblogs, are gathered into a document and regarded as text information of the nodes, and if the microblogs or comments exist among the users in the conversation network, the nodes representing the users are connected; constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,
Figure FDA00027530330800000110
is the set of edges in the dialogue network G, and T is the set of text information attached by the nodes; v. ofiIndicating the ith user in the V,
Figure FDA0002753033080000019
representing a user viThe identity of (a) of (b),
Figure FDA0002753033080000017
representing a user viWherein the subscript n is the document
Figure FDA0002753033080000018
The number of words in (2).
3. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (2) specifically comprises the following steps:
in a user-level dialogue network, random walk is carried out by taking any node as a starting point to obtain a truncated user sequence; capturing neighborhood contexts with different specifications for a user by setting different random walk lengths; suppose S ═ v1,v2,...,vk) Representing a sequence of users, a sequence of user content of samples
Figure FDA0002753033080000011
And user structure sequence
Figure FDA0002753033080000012
Figure FDA0002753033080000013
Is a pair of parallel social context sequences containing content information and structural information, respectively, where k represents the length of the random walk.
4. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (3) specifically comprises the following steps:
given a sequence of user content
Figure FDA0002753033080000014
User will beText information
Figure FDA0002753033080000015
Each word w iniSubstitution into corresponding word embedding
Figure FDA0002753033080000016
Thereby obtaining a text embedding matrix Ei=(w1,w2,...,wn) Where d' represents the dimension of word embedding; embedding matrix E for textiPreserving potential local syntactic and semantic information in user text by using convolution and max-pooling operations, and encoding the information into user content embedding specifically, see formula (1):
vi=max(CNN(Ei)) (1)
original user content sequence via convolution and pooling operations
Figure FDA0002753033080000021
Conversion to user content embedding sequence (v)1,v2,...,vk) (ii) a Embedding user content into a sequence as a user structure sequence
Figure FDA0002753033080000022
Then the input bi-directional LSTM captures forward context and backward context information over the sequence for the user:
Figure FDA0002753033080000023
Figure FDA0002753033080000024
in equations (2) and (3)
Figure FDA0002753033080000025
And
Figure FDA0002753033080000026
is a hidden state corresponding to the forward LSTM and the backward LSTM, and is formed by splicing
Figure FDA0002753033080000027
And
Figure FDA0002753033080000028
get user embedding hiEmbedding the sequence (h) at the user1,h2,...,hk) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, see formula (4):
1,α2,...,αk)=att(h1,h2,...,hk) (4)
wherein alpha isiRepresenting a user viThe contribution to the topic is shown in formula (5); first embed the user in hiCarrying out nonlinear conversion, and then calculating the similarity with the attention vector q of the user to obtain alphai
αi=qT·tanh(W·hi+b) (5)
W and b in the formula (5) are parameters of the neural network, are simultaneously shared with the attention vector q of the user to be learned for all user sequences and user embedding, and tanh (·) is a nonlinear activation function; further carrying out normalization through a softmax function to obtain a user viFor topic importance, see formula (6):
Figure FDA0002753033080000029
in the formula (6), NiRepresenting a user viAnd includes vi,αjRepresenting a neighbour vjImportance to the topic; by embedding all users in a weighted sequence, parallel content and structural context in a captured variable social neighborhood is obtainedUser sequence embedding s with non-linear correlation therebetween, see equation (7):
Figure FDA00027530330800000210
wherein N represents all users on the current sequence; to obtain the user sequence embedding s, the following objective function needs to be minimized:
Figure FDA00027530330800000211
in the formula (8), LseqExpressed as a loss function learning the user sequence embedding s, Ci={vj|vj∈NiC is less than or equal to | j-i |, j ≠ i } represents that the user v isiC is the window size; p (v)j|vi) Representing a given user viNeighbor vjThe conditional probability of (2) is formalized as shown in formula (9):
Figure FDA0002753033080000031
wherein h iskIs user viOf arbitrary sequence-based neighbors vk∈CiEmbedding; the calculation cost of the conditional probability in the formula (9) is reduced by using a negative sampling technology, and an optimized objective function is obtained as follows:
Figure FDA0002753033080000032
where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid function, and L denotes the number of negative samples.
5. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (4) specifically comprises the following steps:
by the use of spiritInferring document-topic distributions θ via variational reasoningd=(p(t1|d),p(t2|d),...,p(tKId)) and topic-term distribution phiw=(p(w|t1),p(w|t2),...,p(w|tK) Where d represents a document, t)iRepresenting the ith theme, K representing the number of themes and w representing words; p (t)iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)i) (i ═ 1, 2., K) denotes the probability that word w belongs to the ith topic;
document-topic distribution: given a user sequence embedding s, it is first mapped into a non-linear implicit space hencThe method comprises the following steps:
henc=ReLU(Wh·s+bh) (11)
wherein WhAnd bhIs a parameter of the encoder, ReLU is a non-linear activation function; assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions2Can be obtained by linear transformation, see equation (12) (13):
μ=Wμ·henc+bμ (12)
log(σ2)=Wσ·henc+bσ (13)
wherein Wμ、bμ、Wσ、bσIs a parameter of the encoder;
further obtaining potential semantic vector by using reparameterization skill
Figure FDA0002753033080000033
The formalization is shown in formula (14):
z=μ+∈×σ (14)
wherein, the element belongs to the Gaussian distribution N (0, I) sampled, and the potential semantic vector z is normalized by utilizing the softmax function to obtain the document-subject distribution thetad
Topic-word distribution: topic-word distribution in textwCan be regarded as parameters of a decoder, and is detailed in a formula(15):
hdec=softmax(φw×(θd)T) (15)
The user sequence embedding s is then reconstructed by the decoder to obtain a reconstructed user sequence embedding s', which is calculated as equation (16), where W(d)And b(d)Are all parameters of the decoder:
s′=ReLU(Wdhdec+bd) (16)
for topic generation, the objective function for this section is formula (17):
Figure FDA0002753033080000041
in formula (17), LgenLoss function values expressed as learned document-topic distributions and topic-term distributions, using KL divergence to measure the closeness of a prior distribution q (z) to a true posterior distribution p (zs), where q (z) is a prior Gaussian distribution N (0, I);
by combining equations (10) and (17), an overall objective function is defined to mine microblog potential topics, see equation (18), where λ is a trade-off of LseqAnd LgenThe hyper-parameters of (a):
L=(1-λ)Lseq+λLgen (18)。
CN202011192126.5A 2020-10-30 2020-10-30 Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood Pending CN112199607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011192126.5A CN112199607A (en) 2020-10-30 2020-10-30 Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011192126.5A CN112199607A (en) 2020-10-30 2020-10-30 Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood

Publications (1)

Publication Number Publication Date
CN112199607A true CN112199607A (en) 2021-01-08

Family

ID=74012162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011192126.5A Pending CN112199607A (en) 2020-10-30 2020-10-30 Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood

Country Status (1)

Country Link
CN (1) CN112199607A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449849A (en) * 2021-06-29 2021-09-28 桂林电子科技大学 Learning type text hash method based on self-encoder
CN113870040A (en) * 2021-09-07 2021-12-31 天津大学 Double-flow graph convolution network microblog topic detection method fusing different propagation modes
CN115879515A (en) * 2023-02-20 2023-03-31 江西财经大学 Document network theme modeling method, variation neighborhood encoder, terminal and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033069A (en) * 2018-06-16 2018-12-18 天津大学 A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour
CN109684646A (en) * 2019-01-15 2019-04-26 江苏大学 A kind of microblog topic sentiment analysis method based on topic influence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033069A (en) * 2018-06-16 2018-12-18 天津大学 A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour
CN109684646A (en) * 2019-01-15 2019-04-26 江苏大学 A kind of microblog topic sentiment analysis method based on topic influence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HONGYU LIU等: "Fusing Parallel Social Contexts within Flexible-Order Proximity for Microblog Topic Detection", 《PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449849A (en) * 2021-06-29 2021-09-28 桂林电子科技大学 Learning type text hash method based on self-encoder
CN113449849B (en) * 2021-06-29 2022-05-27 桂林电子科技大学 Learning type text hash method based on self-encoder
CN113870040A (en) * 2021-09-07 2021-12-31 天津大学 Double-flow graph convolution network microblog topic detection method fusing different propagation modes
CN113870040B (en) * 2021-09-07 2024-05-21 天津大学 Double-flow chart convolution network microblog topic detection method integrating different propagation modes
CN115879515A (en) * 2023-02-20 2023-03-31 江西财经大学 Document network theme modeling method, variation neighborhood encoder, terminal and medium

Similar Documents

Publication Publication Date Title
CN112364161B (en) Microblog theme mining method based on dynamic behaviors of heterogeneous social media users
CN109033069B (en) Microblog theme mining method based on social media user dynamic behaviors
CN112199607A (en) Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood
CN111914185B (en) Text emotion analysis method in social network based on graph attention network
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
Sang et al. Context-dependent propagating-based video recommendation in multimodal heterogeneous information networks
Chen et al. Zero-shot text classification via knowledge graph embedding for social media data
Piao et al. Sparse structure learning via graph neural networks for inductive document classification
Dritsas et al. An apache spark implementation for graph-based hashtag sentiment classification on twitter
Fu et al. Improving distributed word representation and topic model by word-topic mixture model
Li et al. Sentiment analysis of Weibo comments based on graph neural network
Yang et al. PostCom2DR: Utilizing information from post and comments to detect rumors
CN110889505B (en) Cross-media comprehensive reasoning method and system for image-text sequence matching
Ma et al. A time-series based aggregation scheme for topic detection in Weibo short texts
Zhou Research on sentiment analysis model of short text based on deep learning
Sang et al. AAANE: Attention-based adversarial autoencoder for multi-scale network embedding
Zhang et al. Exploring coevolution of emotional contagion and behavior for microblog sentiment analysis: a deep learning architecture
Zhu et al. Intuitive topic discovery by incorporating word-pair's connection into LDA
Richardson et al. Integrating summarization and retrieval for enhanced personalization via large language models
Dai et al. REVAL: Recommend Which Variables to Log With Pretrained Model and Graph Neural Network
CN113343118A (en) Hot event discovery method under mixed new media
Lee et al. Overwhelmed by fear: emotion analysis of COVID-19 Vaccination Tweets
CN113870040B (en) Double-flow chart convolution network microblog topic detection method integrating different propagation modes
Bai et al. Low-rank multimodal fusion algorithm based on context modeling
Liu A comparative study of vector space language models for sentiment analysis using reddit data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210108

WD01 Invention patent application deemed withdrawn after publication