CN112199607A - Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood - Google Patents
Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood Download PDFInfo
- Publication number
- CN112199607A CN112199607A CN202011192126.5A CN202011192126A CN112199607A CN 112199607 A CN112199607 A CN 112199607A CN 202011192126 A CN202011192126 A CN 202011192126A CN 112199607 A CN112199607 A CN 112199607A
- Authority
- CN
- China
- Prior art keywords
- user
- sequence
- topic
- embedding
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood, which comprises the following steps: (1) constructing a user-level dialogue network; (2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity; (3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation; (4) topic generation based on neurovariational reasoning: the user sequence representation is used as the input of the neural variation reasoning, and the inherent complementarity between the content and the structure is balanced in a self-adaptive mode, so that the theme with better consistency is mined.
Description
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to a microblog topic mining method based on parallel social context fusion in a variable neighborhood.
Background
The emergence of social media websites (e.g., the Singlean microblog, etc.) has enabled the form of content on the Internet to change dramatically. Microblogging allows users to publish and browse information on it, and has strong social attribute functions, such as forwarding and commenting. Microblog platforms store huge amounts of text data and grow at an alarming rate each day. The microblog text content contains a large amount of information, and topic information is mined from the microblog text content and can be used for topic recommendation, emergency detection, accurate marketing and the like. At present, the text topic mining technology has better effect when being applied to text data such as news, articles and the like. However, the length of the microblog text is short and is generally limited to 140 characters, and the difficulty of processing the microblog text is greatly increased due to the characteristics of sparse microblog information, random words used by the microblog and the like. Therefore, the topic mining technology facing the microblog needs to adopt a method different from the traditional topic mining method.
Currently, the related research of microblog topic mining mainly comprises: (1) co-occurrence patterns across documents are utilized. The method gathers short messages into a long pseudo document according to heuristic rules such as authors, hash labels and the like or topic attributes of texts, and then utilizes a topic model with a three-layer Bayesian structure to mine potential topics; or the generation of word pairs in the whole corpus is directly modeled, so that the data sparsity of the short text is reduced. (2) Short text semantics are utilized. The method uses the characteristic that word embedding contains rich semantic information, takes short text as a set formed by word embedding, assumes the distribution of theme-word as multi-dimensional Gaussian distribution, and then deduces the theme by using a layered Bayes model; or semantic association between words and contexts in the short text is integrated to model topics, and the semantics of the short text is deeply understood to a certain extent. (3) Social network context information is utilized. The method introduces the structural characteristics of the social network, such as a user-forwarding network and a user-follower network, and supplements static context information for the microblog text content, so as to find more word co-occurrence characteristics; or dynamic context of the social network is introduced, and topics are inferred by mining user behavior characteristics such as dynamic interaction among users and different user concerns.
Although the above methods have achieved good performance, only the microblog text content is modeled or the first-order proximity of the social network is considered at the same time, ignoring the impact of the larger user neighborhood in the microblog conversation on topic inference. As in a larger user neighborhood, users may talk about highly related topics, while users talking about the same topic may have similar opinions about the topic. In addition, independent content and structure representations are learned for the user[1]The non-linear correlation between the two is ignored, and the content representation and the structure representation are spliced by equal weight to ignore different importance of the two in topic inference. These provide a favorable clue for social media-based microblog topic mining tasks.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a microblog topic mining method based on parallel social context fusion in a variable field. In order to capture neighborhood contexts with different specifications, the method constructs a user-level dialogue network based on forwarding and comment relations, wherein nodes represent users, and edges represent forwarding or comment relations between pairs of users. By setting different random walk lengths, user proximity degrees of different orders are firstly modeled, then a self-fused network is designed to capture nonlinear association between parallel content and structural context in the proximity, different influences of users on topics on an attention mechanism modeling sequence are introduced, the two are finally seamlessly integrated into user sequence embedding, and a microblog topic with better consistency is generated through neural variation reasoning. Compared with the existing model, the microblog subject mined by the method is optimal in the evaluation index of Topic Coherence Score (Topic Coherence Score).
The purpose of the invention is realized by the following technical scheme:
a microblog topic mining method based on parallel social context fusion in variable domains comprises the following steps:
(1) building user-level dialog networks
And the user is regarded as a node in a conversation network, and all microblogs related to the user, including a source microblog, a comment microblog and a forwarding microblog, are gathered into a document and regarded as text information of the node. If microblog forwarding or comments exist among users in the session network, nodes referring to the users are connected. Constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,is a set of edges in the dialogue network G, and T is a set of text information attached to a node. v. ofiIndicating the ith user in the V,referring to user viThe identity of (a) of (b),referring to user viWherein n is a documentThe number of words in (2).
(2) Parallel social context sequence: setting different random walk lengths to obtain parallel content and structure context sequences containing different-order user proximity
In the user-level dialogue network, random walk is carried out by taking any node as a starting point, and a truncated user sequence can be obtained. By setting different random walk lengths, neighborhood contexts with different specifications are captured for a user. Suppose S ═ v1,v2,...,vk) Representing a sequence of users, a sequence of user content of samplesAnd user structure sequenceIs a pair of parallel social context sequences, each containing contentInformation and structure information, where k denotes the length of the random walk.
(3) Self-converged network representation: capturing non-linear associations between textual content and network structure[2]And the influence of different users on the theme on the attention machine modeling sequence is introduced to obtain the user sequence representation
Given a sequence of user contentText information of userEach word w iniSubstitution into corresponding word embeddingThereby obtaining a text embedding matrix Ei=(w1,w2,...,wn) Where d' represents the dimension of word embedding. Embedding matrix E for textiPreserving potential local syntactic and semantic information in user text using convolution and max-pooling operations[3]Specifically, the encoding is embedded into the user content, which is detailed in formula (1):
vi=max(CNN(Ei)) (1)
original user content sequence via convolution and pooling operationsConversion to user content embedding sequence (v)1,v2,...,vk). Using it as user structure sequenceThen a bi-directional LSTM is input to capture forward context and backward context information over a sequence for the user:
in equations (2) and (3)Andis a hidden state corresponding to the forward LSTM and the backward LSTM, and is formed by splicingAnduser embedding h can be obtainediEmbedding the sequence (h) at the user1,h2,...,hk) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, which is detailed in formula (4):
(α1,α2,...,αk)=att(h1,h2,...,hk) (4)
in the formula (4), αiRepresenting a user viThe contribution to the topic is specifically calculated in formula (5). First embed the user in hiNon-linear transformation is carried out, and then similarity is calculated with the attention vector q of the user to obtain alphai:
αi=qT·tanh(W·hi+b) (5)
W and b in equation (5) are parameters of the neural network, are shared simultaneously with the user attention vector q for all user sequences and user embeddings, and tanh (-) is a non-linear activation function. Further normalization is carried out through a softmax function to obtain a user viFor topic importance, see formula (6):
in the formula (6), NiRepresenting a user viIncluding v, based on the sequence of the neighbori,αjRepresenting a neighbour vjImportance to the topic. By embedding all users on the weighted sequence, a user sequence embedding s is obtained which captures the non-linear association between parallel content and structural context within the variable social neighborhood, see equation (7).
Where N represents all users on the current sequence. To obtain the user sequence embedding s, the following objective function is minimized:
in the formula (8), LseqExpressed as a loss function learning the user sequence embedding s, Ci={vj|vj∈NiC is less than or equal to | j-i |, j ≠ i } represents that the user v isiC is the window size. p (v)j|vi) Representing a given user viNeighbor vjThe conditional probability of (2) is formalized as shown in formula (9):
wherein h iskIs user viOf arbitrary sequence-based neighbors vk∈CiIs embedded. And (3) further reducing the calculation cost of the conditional probability in the formula (9) by utilizing a negative sampling technology to obtain the following optimized objective function, see the formula (10):
where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid (sigmoid) function, and L denotes the number of negative samples.
(4) Topic generation based on neurovariational reasoning: user sequence representation as a neural variational reasoning[4]The internal complementarity between the content and the structure is balanced in a self-adaptive mode, and therefore the theme with better consistency is mined.
Inferring document-topic distribution θ using neurovariational reasoningd=(p(t1|d),p(t2|d),...,p(tKId)) and topic-term distribution phiw=(p(w|t1),p(w|t2),...,p(w|tK) Where d represents a document, t)iDenotes the ith topic, K denotes the number of topics, and w denotes words. p (t)iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)i) (i ═ 1, 2., K) denotes the probability that the word w belongs to the ith topic.
Document-topic distribution: given a user sequence embedding s, it is first mapped into a non-linear implicit space hencThe method comprises the following steps:
henc=ReLU(Wh·s+bh) (11)
wherein WhAnd bhIs a parameter of the encoder, ReLU is a non-linear activation function. Assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions2Can be obtained by linear transformation, see equation (12) (13):
μ=Wμ·henc+Wμ (12)
log(σ2)=Wμ·henc+bσ (13)
wherein Wμ、Wμ、Wμ、bσAre parameters of the encoder.
Further obtaining potential semantic vector by using reparameterization skillThe formalization is shown in formula (14):
z=μ+∈×σ (14)
in formula (14), the element is sampled from Gaussian distribution N (0, I), and the latent semantic vector z is normalized by utilizing a softmax function to obtain a document-subject distribution thetad。
Topic-word distribution: topic-word distribution in textwCan be regarded as the parameters of the decoder, see formula (15):
hdec=softmax(φw×(θd)T) (15)
reconstructing the user sequence embedding s by the decoder to obtain a reconstructed user sequence embedding s', calculating as formula (16), wherein W(d)And b(d)Are all parameters of the decoder:
s′=ReLU(Wdhdec+bd) (16)
for topic generation, the objective function for this section is formula (17):
in formula (17), LgenThe closeness of the variational distribution q (z), which is a prior gaussian distribution N (0, I), to the true posterior distribution p (zs) is measured using KL divergence, expressed as loss function values for learning document-topic distributions and topic-term distributions.
By combining the formulas (10) and (17), an overall objective function is defined to mine the microblog potential topics, as detailed in the formula (18), wherein λ is a trade-off LseqAnd LgenThe hyper-parameters of (a):
L=(1-λ)Lseq+λLgen (18)
compared with the prior art, the technical scheme of the invention has the following beneficial effects:
(1) in order to solve the problems of sparse microblog text data, random word use and the like, the method simultaneously considers social media content and a social network topological structure, and enriches the context information of the microblog text;
(2) in order to capture the user proximity of different orders, the method of the invention sets different random walk lengths on different dialogue networks constructed by different data sets, and captures the field context of different specifications for the user according to different interaction characteristics;
(3) in order to capture the nonlinear association between the text content and the network structure in the user neighborhood, the method utilizes a self-fused network representation and an attention mechanism to seamlessly integrate the text content and the network structure into a user sequence for embedding;
(4) in order to generate a theme with better consistency, the method embeds the user sequence capturing the nonlinear association between the parallel content and the structural context in the variable social domain into the neural variational reasoning, and adaptively balances the intrinsic complementarity and different importance of the two in topic inference to generate the theme;
(5) the experimental result on the real Sina microblog data set shows the effectiveness of the method, and proves the effectiveness of capturing the user proximity of different orders and the nonlinear association between modeling content and structure on microblog topic mining.
Drawings
Fig. 1a and fig. 1b are frame diagrams of a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood provided by the invention; wherein the left part of FIG. 1a is the construction of the user-level dialog network in the embodiment; FIG. 1a is a diagram illustrating the acquisition of parallel social context sequences, according to an embodiment; FIG. 1b shows, in the left part, a self-converged network representation in an embodiment; FIG. 1b right part is the generation of topics based on neuro-variational reasoning in an embodiment;
FIG. 2a is a variation of topic coherence score over a 5-month data set with random walk length in an embodiment;
FIG. 2b is a variation of topic coherence score over a 6 month data set with random walk length in an embodiment;
FIG. 2c is a variation of topic continuity score over a 7 month data set with random walk length in an embodiment.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and the detailed description. It should be understood that the embodiments described herein are only for illustrating the present invention and are not to be construed as limiting the present invention.
The specific implementation method of the present invention is given by taking 3 real microblog data sets as an example, and the overall framework of the method is shown in fig. 1a and 1 b. The whole system algorithm process comprises four steps of user-level dialogue network input, acquisition of parallel social context sequences containing different-order user proximity, self-fusion network representation of nonlinear association between captured content and structure, and theme generation based on neuro-variational reasoning.
The method comprises the following specific steps:
(1) user-level conversational web input
On a Sina microblog platform, the predecessor collects related microblogs covering 50 hot topics in three months of 5 months, 6 months and 7 months in 2014 by using a topic index search application programming interface (hashtag-search API). The invention takes the 3 common real microblog data sets as original corpora, and carries out pretreatment according to the following steps to construct a user-level conversation network: 1) filtering users without forwarding or commenting relations; 2) and aggregating all microblogs related to the same user, including a source microblog, a forwarding microblog and a comment microblog, into a document as text information of a node pointed by the user.
Table 1 shows detailed statistical information for three data sets, as follows: the 5-month data set comprises 44395 users in total, wherein 64292 users who have forwarding or comment relations account for 70893 microblogs; the 6 month data set includes 89979 users in total, of which 151427 total 163420 microblogs for users with forwarding or commenting relationships; the 7 month dataset includes 119269 users in total, of which 178154 total 188657 microblogs for users with forwarding or commenting relationships. FIG. 1a illustrates a user-level dialog network constructed from inter-user forwarding or comment relationships.
TABLE 1 microblog data set statistics
(2) Obtaining parallel social context sequences
And (3) processing the user-level dialogue network constructed in the previous step as follows:
and carrying out random walk by taking any node as a starting point to obtain a truncated user sequence. By setting different random walk lengths, neighborhood contexts with different specifications are captured for the user. Suppose S ═ v1,v2,...,vk) Representing a sequence of users, obtained by random walk sampling of length k, the sequence of user contentsAnd user structure sequenceIs a pair of parallel social context sequences containing content information and structural information, respectively.
(3) Self-converged network representation
To capture the non-linear correlation between parallel content and structural context within a variable social neighborhood, we train the model according to the following objective function, learning the user sequence embedding s:
the meaning of the symbols in the formula is as described above. By training the objective function, the model seamlessly integrates the content information and the structure information of the user and models the influence of different users on the topic of discussion on the sequence.
(4) Topic generation based on neural variational reasoning
To integrate the intrinsic complementarity between social media content and network structure to infer topics, we reconstruct the user sequence embedded s input variational self-encoder as follows:
the meaning of the symbols in the formula is as described above. By training the objective function, the model adaptively balances different importance of content and structure to the topic to generate the topic together.
The model overall objective function is as follows:
L=(1-λ)Lseq+λLgen
in the specific implementation process, various hyper-parameters are set in advance, namely the embedding dimension is 200, the weighing coefficient lambda is 0.9, and the number of random walks is 10, so that the topic of microblog data is deduced. In order to capture user neighborhood contexts with different specifications, different random walk step lengths are set on different data sets, specifically, the random walk step length of a data set of 5 months is set to be 7, the random walk step length of a data set of 6 months is set to be 10, and the random walk step length of a data set of 7 months is set to be 3.
To verify the effectiveness of the method of the invention, the method of the invention (PCFTM) was compared with currently advanced and representative models (BTM, LCTM, LeadLDA, ForumLDA, IATM) and two variants of the method of the invention (PCFTM (-seq), PCFTM (-fus)).
Btm (btrm Topic model) assumes that both words in a word pair belong to the same Topic, and Topic inference is performed by modeling the generation of all word pairs in the entire corpus.
LCTM (Laten Concept Topic model) introduces word embedding to enhance the understanding of the short text semantics in order to solve the data sparsity problem of the short text. The model simultaneously introduces a new hidden variable-concept (concept) to capture the semantic similarity of the words, and assumes that the subject is the distribution on the concept and the concept is the distribution on the word embedding.
LeadLDA[5]And constructing a conversation tree according to the forwarding and replying relations among the microblogs, and further deducing hidden topics according to the topic dependency relation of the leader message and the follower message on the conversation tree.
ForumLDA models the topic generation process by distinguishing whether the forwarded microblog is related to the original microblog in topic.
The IATM (Interaction-Aware Topic Model) considers text content and dynamic user behaviors in a social network at the same time, and mines topics by modeling user dynamic Interaction and different user concerns and then using neuro-variational reasoning.
The PCFTM (-seq) considers only the first-order proximity of the user and fuses content and structural information for topic generation based on neural variational reasoning.
The PCFTM (-fus) considers user neighborhoods of different specifications, but simply combines the content and structural context of the user sequence for topic generation based on neuro-variational reasoning.
The evaluation index of the experimental performance adopts a topic coherence score (topic coherence), and the formula is as follows:
tables 2, 3, and 4 show topic coherence results of the model and all comparison methods on three microblog data sets, respectively. For each data set, consistency score values of top 10(N is 10), 15(N is 15), and 20(N is 20) words of the inferred topic when the topic number K is 50 and 100 are recorded. Higher topic continuity score values indicate better performance of the model.
TABLE 2 comparison of Performance of the method of the present invention with other methods on a 5 month dataset
TABLE 3 comparison of Performance of the method of the present invention with other methods on a 6 month dataset
TABLE 4 comparison of Performance of the method of the present invention with other methods on a 7 month dataset
As can be seen from the topic coherence results in tables 2, 3 and 4, the method provided by the invention achieves greater performance improvement by modeling user neighborhoods with different specifications and capturing the nonlinear association between the content and the structural context. In order to further observe proper proximity settings on different data sets, fig. 2a to 2c show the relevant changes of topic continuity scores and random walk lengths on three microblog data sets by the method of the present invention, which illustrates the effectiveness of the microblog topic mining method based on the fusion of parallel social contexts in a variable neighborhood provided by the present invention.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Reference documents:
[1]He R,Zhang X,Jin D,et al.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In:Proc.of the 27th International Conference on Computational Linguistics.2018:1398-1409.
[2]Liu J,He Z,Wei L,et al.Content to node:Self-translation network embedding.In:Proc.of the 24th International Conference on Knowledge Discovery&Data Mining.2018:1794-1802.
[3]Liu J,Li N,He Z,et al.Network Embedding with Dual Generation Tasks.In:Proc.of the 28th International Joint Conference on Artificial Intelligence,2019:5102-5108.
[4]Srivastava A,Sutton C.Autoencoding Variational Inference for Topic Models.In:Proc.of the 5th International Conference on Learning Representations,2017.
[5]Li J,Liao M,Gao W,et al.Topic Extraction from Microblog Posts Using Conversation Structures.In:Proc.of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1722–1731.
Claims (5)
1. a microblog topic mining method based on parallel social context fusion in a variable neighborhood is characterized by comprising the following steps:
(1) constructing a user-level dialogue network;
(2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity;
(3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation;
(4) topic generation based on neurovariational reasoning: and taking the user sequence representation as the input of the neural variation reasoning, and adaptively balancing the inherent complementarity between the content and the structure so as to mine the theme with better consistency.
2. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (1) specifically comprises the following steps:
the method comprises the steps that users are regarded as nodes in a conversation network, all microblogs related to the corresponding users, including source microblogs, comment microblogs and forwarding microblogs, are gathered into a document and regarded as text information of the nodes, and if the microblogs or comments exist among the users in the conversation network, the nodes representing the users are connected; constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,is the set of edges in the dialogue network G, and T is the set of text information attached by the nodes; v. ofiIndicating the ith user in the V,representing a user viThe identity of (a) of (b),representing a user viWherein the subscript n is the documentThe number of words in (2).
3. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (2) specifically comprises the following steps:
in a user-level dialogue network, random walk is carried out by taking any node as a starting point to obtain a truncated user sequence; capturing neighborhood contexts with different specifications for a user by setting different random walk lengths; suppose S ═ v1,v2,...,vk) Representing a sequence of users, a sequence of user content of samplesAnd user structure sequence Is a pair of parallel social context sequences containing content information and structural information, respectively, where k represents the length of the random walk.
4. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (3) specifically comprises the following steps:
given a sequence of user contentUser will beText informationEach word w iniSubstitution into corresponding word embeddingThereby obtaining a text embedding matrix Ei=(w1,w2,...,wn) Where d' represents the dimension of word embedding; embedding matrix E for textiPreserving potential local syntactic and semantic information in user text by using convolution and max-pooling operations, and encoding the information into user content embedding specifically, see formula (1):
vi=max(CNN(Ei)) (1)
original user content sequence via convolution and pooling operationsConversion to user content embedding sequence (v)1,v2,...,vk) (ii) a Embedding user content into a sequence as a user structure sequenceThen the input bi-directional LSTM captures forward context and backward context information over the sequence for the user:
in equations (2) and (3)Andis a hidden state corresponding to the forward LSTM and the backward LSTM, and is formed by splicingAndget user embedding hiEmbedding the sequence (h) at the user1,h2,...,hk) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, see formula (4):
(α1,α2,...,αk)=att(h1,h2,...,hk) (4)
wherein alpha isiRepresenting a user viThe contribution to the topic is shown in formula (5); first embed the user in hiCarrying out nonlinear conversion, and then calculating the similarity with the attention vector q of the user to obtain alphai:
αi=qT·tanh(W·hi+b) (5)
W and b in the formula (5) are parameters of the neural network, are simultaneously shared with the attention vector q of the user to be learned for all user sequences and user embedding, and tanh (·) is a nonlinear activation function; further carrying out normalization through a softmax function to obtain a user viFor topic importance, see formula (6):
in the formula (6), NiRepresenting a user viAnd includes vi,αjRepresenting a neighbour vjImportance to the topic; by embedding all users in a weighted sequence, parallel content and structural context in a captured variable social neighborhood is obtainedUser sequence embedding s with non-linear correlation therebetween, see equation (7):
wherein N represents all users on the current sequence; to obtain the user sequence embedding s, the following objective function needs to be minimized:
in the formula (8), LseqExpressed as a loss function learning the user sequence embedding s, Ci={vj|vj∈NiC is less than or equal to | j-i |, j ≠ i } represents that the user v isiC is the window size; p (v)j|vi) Representing a given user viNeighbor vjThe conditional probability of (2) is formalized as shown in formula (9):
wherein h iskIs user viOf arbitrary sequence-based neighbors vk∈CiEmbedding; the calculation cost of the conditional probability in the formula (9) is reduced by using a negative sampling technology, and an optimized objective function is obtained as follows:
where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid function, and L denotes the number of negative samples.
5. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (4) specifically comprises the following steps:
by the use of spiritInferring document-topic distributions θ via variational reasoningd=(p(t1|d),p(t2|d),...,p(tKId)) and topic-term distribution phiw=(p(w|t1),p(w|t2),...,p(w|tK) Where d represents a document, t)iRepresenting the ith theme, K representing the number of themes and w representing words; p (t)iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)i) (i ═ 1, 2., K) denotes the probability that word w belongs to the ith topic;
document-topic distribution: given a user sequence embedding s, it is first mapped into a non-linear implicit space hencThe method comprises the following steps:
henc=ReLU(Wh·s+bh) (11)
wherein WhAnd bhIs a parameter of the encoder, ReLU is a non-linear activation function; assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions2Can be obtained by linear transformation, see equation (12) (13):
μ=Wμ·henc+bμ (12)
log(σ2)=Wσ·henc+bσ (13)
wherein Wμ、bμ、Wσ、bσIs a parameter of the encoder;
further obtaining potential semantic vector by using reparameterization skillThe formalization is shown in formula (14):
z=μ+∈×σ (14)
wherein, the element belongs to the Gaussian distribution N (0, I) sampled, and the potential semantic vector z is normalized by utilizing the softmax function to obtain the document-subject distribution thetad;
Topic-word distribution: topic-word distribution in textwCan be regarded as parameters of a decoder, and is detailed in a formula(15):
hdec=softmax(φw×(θd)T) (15)
The user sequence embedding s is then reconstructed by the decoder to obtain a reconstructed user sequence embedding s', which is calculated as equation (16), where W(d)And b(d)Are all parameters of the decoder:
s′=ReLU(Wdhdec+bd) (16)
for topic generation, the objective function for this section is formula (17):
in formula (17), LgenLoss function values expressed as learned document-topic distributions and topic-term distributions, using KL divergence to measure the closeness of a prior distribution q (z) to a true posterior distribution p (zs), where q (z) is a prior Gaussian distribution N (0, I);
by combining equations (10) and (17), an overall objective function is defined to mine microblog potential topics, see equation (18), where λ is a trade-off of LseqAnd LgenThe hyper-parameters of (a):
L=(1-λ)Lseq+λLgen (18)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011192126.5A CN112199607A (en) | 2020-10-30 | 2020-10-30 | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011192126.5A CN112199607A (en) | 2020-10-30 | 2020-10-30 | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112199607A true CN112199607A (en) | 2021-01-08 |
Family
ID=74012162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011192126.5A Pending CN112199607A (en) | 2020-10-30 | 2020-10-30 | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112199607A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449849A (en) * | 2021-06-29 | 2021-09-28 | 桂林电子科技大学 | Learning type text hash method based on self-encoder |
CN113870040A (en) * | 2021-09-07 | 2021-12-31 | 天津大学 | Double-flow graph convolution network microblog topic detection method fusing different propagation modes |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033069A (en) * | 2018-06-16 | 2018-12-18 | 天津大学 | A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour |
CN109684646A (en) * | 2019-01-15 | 2019-04-26 | 江苏大学 | A kind of microblog topic sentiment analysis method based on topic influence |
-
2020
- 2020-10-30 CN CN202011192126.5A patent/CN112199607A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033069A (en) * | 2018-06-16 | 2018-12-18 | 天津大学 | A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour |
CN109684646A (en) * | 2019-01-15 | 2019-04-26 | 江苏大学 | A kind of microblog topic sentiment analysis method based on topic influence |
Non-Patent Citations (1)
Title |
---|
HONGYU LIU等: "Fusing Parallel Social Contexts within Flexible-Order Proximity for Microblog Topic Detection", 《PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449849A (en) * | 2021-06-29 | 2021-09-28 | 桂林电子科技大学 | Learning type text hash method based on self-encoder |
CN113449849B (en) * | 2021-06-29 | 2022-05-27 | 桂林电子科技大学 | Learning type text hash method based on self-encoder |
CN113870040A (en) * | 2021-09-07 | 2021-12-31 | 天津大学 | Double-flow graph convolution network microblog topic detection method fusing different propagation modes |
CN113870040B (en) * | 2021-09-07 | 2024-05-21 | 天津大学 | Double-flow chart convolution network microblog topic detection method integrating different propagation modes |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112364161B (en) | Microblog theme mining method based on dynamic behaviors of heterogeneous social media users | |
CN109033069B (en) | Microblog theme mining method based on social media user dynamic behaviors | |
CN112199607A (en) | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood | |
CN111914185B (en) | Text emotion analysis method in social network based on graph attention network | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
Sang et al. | Context-dependent propagating-based video recommendation in multimodal heterogeneous information networks | |
Chen et al. | Zero-shot text classification via knowledge graph embedding for social media data | |
Piao et al. | Sparse structure learning via graph neural networks for inductive document classification | |
Dritsas et al. | An apache spark implementation for graph-based hashtag sentiment classification on twitter | |
Fu et al. | Improving distributed word representation and topic model by word-topic mixture model | |
Li et al. | Sentiment analysis of Weibo comments based on graph neural network | |
Yang et al. | PostCom2DR: Utilizing information from post and comments to detect rumors | |
CN110889505B (en) | Cross-media comprehensive reasoning method and system for image-text sequence matching | |
Ma et al. | A time-series based aggregation scheme for topic detection in Weibo short texts | |
Zhou | Research on sentiment analysis model of short text based on deep learning | |
Sang et al. | AAANE: Attention-based adversarial autoencoder for multi-scale network embedding | |
Zhang et al. | Exploring coevolution of emotional contagion and behavior for microblog sentiment analysis: a deep learning architecture | |
Zhu et al. | Intuitive topic discovery by incorporating word-pair's connection into LDA | |
Richardson et al. | Integrating summarization and retrieval for enhanced personalization via large language models | |
Dai et al. | REVAL: Recommend Which Variables to Log With Pretrained Model and Graph Neural Network | |
CN113343118A (en) | Hot event discovery method under mixed new media | |
Lee et al. | Overwhelmed by fear: emotion analysis of COVID-19 Vaccination Tweets | |
CN113870040B (en) | Double-flow chart convolution network microblog topic detection method integrating different propagation modes | |
Bai et al. | Low-rank multimodal fusion algorithm based on context modeling | |
Liu | A comparative study of vector space language models for sentiment analysis using reddit data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210108 |
|
WD01 | Invention patent application deemed withdrawn after publication |