CN112199607A

CN112199607A - Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood

Info

Publication number: CN112199607A
Application number: CN202011192126.5A
Authority: CN
Inventors: 贺瑞芳; 刘宏宇; 朱永凯; 王浩成; 韩迪
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-08

Abstract

The invention discloses a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood, which comprises the following steps: (1) constructing a user-level dialogue network; (2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity; (3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation; (4) topic generation based on neurovariational reasoning: the user sequence representation is used as the input of the neural variation reasoning, and the inherent complementarity between the content and the structure is balanced in a self-adaptive mode, so that the theme with better consistency is mined.

Description

Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood

Technical Field

The invention relates to the technical field of natural language processing and social media data mining, in particular to a microblog topic mining method based on parallel social context fusion in a variable neighborhood.

Background

The emergence of social media websites (e.g., the Singlean microblog, etc.) has enabled the form of content on the Internet to change dramatically. Microblogging allows users to publish and browse information on it, and has strong social attribute functions, such as forwarding and commenting. Microblog platforms store huge amounts of text data and grow at an alarming rate each day. The microblog text content contains a large amount of information, and topic information is mined from the microblog text content and can be used for topic recommendation, emergency detection, accurate marketing and the like. At present, the text topic mining technology has better effect when being applied to text data such as news, articles and the like. However, the length of the microblog text is short and is generally limited to 140 characters, and the difficulty of processing the microblog text is greatly increased due to the characteristics of sparse microblog information, random words used by the microblog and the like. Therefore, the topic mining technology facing the microblog needs to adopt a method different from the traditional topic mining method.

Currently, the related research of microblog topic mining mainly comprises: (1) co-occurrence patterns across documents are utilized. The method gathers short messages into a long pseudo document according to heuristic rules such as authors, hash labels and the like or topic attributes of texts, and then utilizes a topic model with a three-layer Bayesian structure to mine potential topics; or the generation of word pairs in the whole corpus is directly modeled, so that the data sparsity of the short text is reduced. (2) Short text semantics are utilized. The method uses the characteristic that word embedding contains rich semantic information, takes short text as a set formed by word embedding, assumes the distribution of theme-word as multi-dimensional Gaussian distribution, and then deduces the theme by using a layered Bayes model; or semantic association between words and contexts in the short text is integrated to model topics, and the semantics of the short text is deeply understood to a certain extent. (3) Social network context information is utilized. The method introduces the structural characteristics of the social network, such as a user-forwarding network and a user-follower network, and supplements static context information for the microblog text content, so as to find more word co-occurrence characteristics; or dynamic context of the social network is introduced, and topics are inferred by mining user behavior characteristics such as dynamic interaction among users and different user concerns.

Although the above methods have achieved good performance, only the microblog text content is modeled or the first-order proximity of the social network is considered at the same time, ignoring the impact of the larger user neighborhood in the microblog conversation on topic inference. As in a larger user neighborhood, users may talk about highly related topics, while users talking about the same topic may have similar opinions about the topic. In addition, independent content and structure representations are learned for the user^[1]The non-linear correlation between the two is ignored, and the content representation and the structure representation are spliced by equal weight to ignore different importance of the two in topic inference. These provide a favorable clue for social media-based microblog topic mining tasks.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a microblog topic mining method based on parallel social context fusion in a variable field. In order to capture neighborhood contexts with different specifications, the method constructs a user-level dialogue network based on forwarding and comment relations, wherein nodes represent users, and edges represent forwarding or comment relations between pairs of users. By setting different random walk lengths, user proximity degrees of different orders are firstly modeled, then a self-fused network is designed to capture nonlinear association between parallel content and structural context in the proximity, different influences of users on topics on an attention mechanism modeling sequence are introduced, the two are finally seamlessly integrated into user sequence embedding, and a microblog topic with better consistency is generated through neural variation reasoning. Compared with the existing model, the microblog subject mined by the method is optimal in the evaluation index of Topic Coherence Score (Topic Coherence Score).

The purpose of the invention is realized by the following technical scheme:

a microblog topic mining method based on parallel social context fusion in variable domains comprises the following steps:

(1) building user-level dialog networks

And the user is regarded as a node in a conversation network, and all microblogs related to the user, including a source microblog, a comment microblog and a forwarding microblog, are gathered into a document and regarded as text information of the node. If microblog forwarding or comments exist among users in the session network, nodes referring to the users are connected. Constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,

is a set of edges in the dialogue network G, and T is a set of text information attached to a node. v. of_iIndicating the ith user in the V,

referring to user v_iThe identity of (a) of (b),

referring to user v_iWherein n is a document

The number of words in (2).

(2) Parallel social context sequence: setting different random walk lengths to obtain parallel content and structure context sequences containing different-order user proximity

In the user-level dialogue network, random walk is carried out by taking any node as a starting point, and a truncated user sequence can be obtained. By setting different random walk lengths, neighborhood contexts with different specifications are captured for a user. Suppose S ═ v₁，v₂，...，v_k) Representing a sequence of users, a sequence of user content of samples

And user structure sequence

Is a pair of parallel social context sequences, each containing contentInformation and structure information, where k denotes the length of the random walk.

(3) Self-converged network representation: capturing non-linear associations between textual content and network structure^[2]And the influence of different users on the theme on the attention machine modeling sequence is introduced to obtain the user sequence representation

Given a sequence of user content

Text information of user

Each word w in_iSubstitution into corresponding word embedding

Thereby obtaining a text embedding matrix E_i＝(w₁，w₂，...，w_n) Where d' represents the dimension of word embedding. Embedding matrix E for text_iPreserving potential local syntactic and semantic information in user text using convolution and max-pooling operations^[3]Specifically, the encoding is embedded into the user content, which is detailed in formula (1):

v_i＝max(CNN(E_i)) (1)

original user content sequence via convolution and pooling operations

Conversion to user content embedding sequence (v)₁，v₂，...，v_k). Using it as user structure sequence

Then a bi-directional LSTM is input to capture forward context and backward context information over a sequence for the user:

in equations (2) and (3)

And

is a hidden state corresponding to the forward LSTM and the backward LSTM, and is formed by splicing

And

user embedding h can be obtained_iEmbedding the sequence (h) at the user₁，h₂，...，h_k) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, which is detailed in formula (4):

(α₁，α₂，...，α_k)＝att(h₁，h₂，...，h_k) (4)

in the formula (4), α_iRepresenting a user v_iThe contribution to the topic is specifically calculated in formula (5). First embed the user in h_iNon-linear transformation is carried out, and then similarity is calculated with the attention vector q of the user to obtain alpha_i：

α_i＝q^T·tanh(W·h_i+b) (5)

W and b in equation (5) are parameters of the neural network, are shared simultaneously with the user attention vector q for all user sequences and user embeddings, and tanh (-) is a non-linear activation function. Further normalization is carried out through a softmax function to obtain a user v_iFor topic importance, see formula (6):

in the formula (6), N_iRepresenting a user v_iIncluding v, based on the sequence of the neighbor_i，α_jRepresenting a neighbour v_jImportance to the topic. By embedding all users on the weighted sequence, a user sequence embedding s is obtained which captures the non-linear association between parallel content and structural context within the variable social neighborhood, see equation (7).

Where N represents all users on the current sequence. To obtain the user sequence embedding s, the following objective function is minimized:

in the formula (8), L_seqExpressed as a loss function learning the user sequence embedding s, C_i＝{v_j|v_j∈N_iC is less than or equal to | j-i |, j ≠ i } represents that the user v is_iC is the window size. p (v)_j|v_i) Representing a given user v_iNeighbor v_jThe conditional probability of (2) is formalized as shown in formula (9):

wherein h is_kIs user v_iOf arbitrary sequence-based neighbors v_k∈C_iIs embedded. And (3) further reducing the calculation cost of the conditional probability in the formula (9) by utilizing a negative sampling technology to obtain the following optimized objective function, see the formula (10):

where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid (sigmoid) function, and L denotes the number of negative samples.

(4) Topic generation based on neurovariational reasoning: user sequence representation as a neural variational reasoning^[4]The internal complementarity between the content and the structure is balanced in a self-adaptive mode, and therefore the theme with better consistency is mined.

Inferring document-topic distribution θ using neurovariational reasoning_d＝(p(t₁|d)，p(t₂|d)，...，p(t_KId)) and topic-term distribution phi_w＝(p(w|t₁)，p(w|t₂)，...，p(w|t_K) Where d represents a document, t)_iDenotes the ith topic, K denotes the number of topics, and w denotes words. p (t)_iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)_i) (i ═ 1, 2., K) denotes the probability that the word w belongs to the ith topic.

Document-topic distribution: given a user sequence embedding s, it is first mapped into a non-linear implicit space h_encThe method comprises the following steps:

h_enc＝ReLU(W^h·s+b^h) (11)

wherein W^hAnd b^hIs a parameter of the encoder, ReLU is a non-linear activation function. Assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions²Can be obtained by linear transformation, see equation (12) (13):

μ＝W^μ·h_enc+W^μ (12)

log(σ²)＝W^μ·h_enc+b^σ (13)

wherein W^μ、W^μ、W^μ、b^σAre parameters of the encoder.

Further obtaining potential semantic vector by using reparameterization skill

The formalization is shown in formula (14):

z＝μ+∈×σ (14)

in formula (14), the element is sampled from Gaussian distribution N (0, I), and the latent semantic vector z is normalized by utilizing a softmax function to obtain a document-subject distribution theta_d。

Topic-word distribution: topic-word distribution in text_wCan be regarded as the parameters of the decoder, see formula (15):

h_dec＝softmax(φ_w×(θ_d)^T) (15)

reconstructing the user sequence embedding s by the decoder to obtain a reconstructed user sequence embedding s', calculating as formula (16), wherein W^(d)And b^(d)Are all parameters of the decoder:

s′＝ReLU(W^dh_dec+b^d) (16)

for topic generation, the objective function for this section is formula (17):

in formula (17), L_genThe closeness of the variational distribution q (z), which is a prior gaussian distribution N (0, I), to the true posterior distribution p (zs) is measured using KL divergence, expressed as loss function values for learning document-topic distributions and topic-term distributions.

By combining the formulas (10) and (17), an overall objective function is defined to mine the microblog potential topics, as detailed in the formula (18), wherein λ is a trade-off L_seqAnd L_genThe hyper-parameters of (a):

L＝(1-λ)L_seq+λL_gen (18)

compared with the prior art, the technical scheme of the invention has the following beneficial effects:

(1) in order to solve the problems of sparse microblog text data, random word use and the like, the method simultaneously considers social media content and a social network topological structure, and enriches the context information of the microblog text;

(2) in order to capture the user proximity of different orders, the method of the invention sets different random walk lengths on different dialogue networks constructed by different data sets, and captures the field context of different specifications for the user according to different interaction characteristics;

(3) in order to capture the nonlinear association between the text content and the network structure in the user neighborhood, the method utilizes a self-fused network representation and an attention mechanism to seamlessly integrate the text content and the network structure into a user sequence for embedding;

(4) in order to generate a theme with better consistency, the method embeds the user sequence capturing the nonlinear association between the parallel content and the structural context in the variable social domain into the neural variational reasoning, and adaptively balances the intrinsic complementarity and different importance of the two in topic inference to generate the theme;

(5) the experimental result on the real Sina microblog data set shows the effectiveness of the method, and proves the effectiveness of capturing the user proximity of different orders and the nonlinear association between modeling content and structure on microblog topic mining.

Drawings

Fig. 1a and fig. 1b are frame diagrams of a microblog topic mining method based on fusion of parallel social contexts in a variable neighborhood provided by the invention; wherein the left part of FIG. 1a is the construction of the user-level dialog network in the embodiment; FIG. 1a is a diagram illustrating the acquisition of parallel social context sequences, according to an embodiment; FIG. 1b shows, in the left part, a self-converged network representation in an embodiment; FIG. 1b right part is the generation of topics based on neuro-variational reasoning in an embodiment;

FIG. 2a is a variation of topic coherence score over a 5-month data set with random walk length in an embodiment;

FIG. 2b is a variation of topic coherence score over a 6 month data set with random walk length in an embodiment;

FIG. 2c is a variation of topic continuity score over a 7 month data set with random walk length in an embodiment.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and the detailed description. It should be understood that the embodiments described herein are only for illustrating the present invention and are not to be construed as limiting the present invention.

The specific implementation method of the present invention is given by taking 3 real microblog data sets as an example, and the overall framework of the method is shown in fig. 1a and 1 b. The whole system algorithm process comprises four steps of user-level dialogue network input, acquisition of parallel social context sequences containing different-order user proximity, self-fusion network representation of nonlinear association between captured content and structure, and theme generation based on neuro-variational reasoning.

The method comprises the following specific steps:

(1) user-level conversational web input

On a Sina microblog platform, the predecessor collects related microblogs covering 50 hot topics in three months of 5 months, 6 months and 7 months in 2014 by using a topic index search application programming interface (hashtag-search API). The invention takes the 3 common real microblog data sets as original corpora, and carries out pretreatment according to the following steps to construct a user-level conversation network: 1) filtering users without forwarding or commenting relations; 2) and aggregating all microblogs related to the same user, including a source microblog, a forwarding microblog and a comment microblog, into a document as text information of a node pointed by the user.

Table 1 shows detailed statistical information for three data sets, as follows: the 5-month data set comprises 44395 users in total, wherein 64292 users who have forwarding or comment relations account for 70893 microblogs; the 6 month data set includes 89979 users in total, of which 151427 total 163420 microblogs for users with forwarding or commenting relationships; the 7 month dataset includes 119269 users in total, of which 178154 total 188657 microblogs for users with forwarding or commenting relationships. FIG. 1a illustrates a user-level dialog network constructed from inter-user forwarding or comment relationships.

TABLE 1 microblog data set statistics

(2) Obtaining parallel social context sequences

And (3) processing the user-level dialogue network constructed in the previous step as follows:

and carrying out random walk by taking any node as a starting point to obtain a truncated user sequence. By setting different random walk lengths, neighborhood contexts with different specifications are captured for the user. Suppose S ═ v₁，v₂，...，v_k) Representing a sequence of users, obtained by random walk sampling of length k, the sequence of user contents

And user structure sequence

Is a pair of parallel social context sequences containing content information and structural information, respectively.

(3) Self-converged network representation

To capture the non-linear correlation between parallel content and structural context within a variable social neighborhood, we train the model according to the following objective function, learning the user sequence embedding s:

the meaning of the symbols in the formula is as described above. By training the objective function, the model seamlessly integrates the content information and the structure information of the user and models the influence of different users on the topic of discussion on the sequence.

(4) Topic generation based on neural variational reasoning

To integrate the intrinsic complementarity between social media content and network structure to infer topics, we reconstruct the user sequence embedded s input variational self-encoder as follows:

the meaning of the symbols in the formula is as described above. By training the objective function, the model adaptively balances different importance of content and structure to the topic to generate the topic together.

The model overall objective function is as follows:

L＝(1-λ)L_seq+λL_gen

in the specific implementation process, various hyper-parameters are set in advance, namely the embedding dimension is 200, the weighing coefficient lambda is 0.9, and the number of random walks is 10, so that the topic of microblog data is deduced. In order to capture user neighborhood contexts with different specifications, different random walk step lengths are set on different data sets, specifically, the random walk step length of a data set of 5 months is set to be 7, the random walk step length of a data set of 6 months is set to be 10, and the random walk step length of a data set of 7 months is set to be 3.

To verify the effectiveness of the method of the invention, the method of the invention (PCFTM) was compared with currently advanced and representative models (BTM, LCTM, LeadLDA, ForumLDA, IATM) and two variants of the method of the invention (PCFTM (-seq), PCFTM (-fus)).

Btm (btrm Topic model) assumes that both words in a word pair belong to the same Topic, and Topic inference is performed by modeling the generation of all word pairs in the entire corpus.

LCTM (Laten Concept Topic model) introduces word embedding to enhance the understanding of the short text semantics in order to solve the data sparsity problem of the short text. The model simultaneously introduces a new hidden variable-concept (concept) to capture the semantic similarity of the words, and assumes that the subject is the distribution on the concept and the concept is the distribution on the word embedding.

LeadLDA^[5]And constructing a conversation tree according to the forwarding and replying relations among the microblogs, and further deducing hidden topics according to the topic dependency relation of the leader message and the follower message on the conversation tree.

ForumLDA models the topic generation process by distinguishing whether the forwarded microblog is related to the original microblog in topic.

The IATM (Interaction-Aware Topic Model) considers text content and dynamic user behaviors in a social network at the same time, and mines topics by modeling user dynamic Interaction and different user concerns and then using neuro-variational reasoning.

The PCFTM (-seq) considers only the first-order proximity of the user and fuses content and structural information for topic generation based on neural variational reasoning.

The PCFTM (-fus) considers user neighborhoods of different specifications, but simply combines the content and structural context of the user sequence for topic generation based on neuro-variational reasoning.

The evaluation index of the experimental performance adopts a topic coherence score (topic coherence), and the formula is as follows:

tables 2, 3, and 4 show topic coherence results of the model and all comparison methods on three microblog data sets, respectively. For each data set, consistency score values of top 10(N is 10), 15(N is 15), and 20(N is 20) words of the inferred topic when the topic number K is 50 and 100 are recorded. Higher topic continuity score values indicate better performance of the model.

TABLE 2 comparison of Performance of the method of the present invention with other methods on a 5 month dataset

TABLE 3 comparison of Performance of the method of the present invention with other methods on a 6 month dataset

TABLE 4 comparison of Performance of the method of the present invention with other methods on a 7 month dataset

As can be seen from the topic coherence results in tables 2, 3 and 4, the method provided by the invention achieves greater performance improvement by modeling user neighborhoods with different specifications and capturing the nonlinear association between the content and the structural context. In order to further observe proper proximity settings on different data sets, fig. 2a to 2c show the relevant changes of topic continuity scores and random walk lengths on three microblog data sets by the method of the present invention, which illustrates the effectiveness of the microblog topic mining method based on the fusion of parallel social contexts in a variable neighborhood provided by the present invention.

The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Reference documents:

[1]He R,Zhang X,Jin D,et al.Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention.In:Proc.of the 27th International Conference on Computational Linguistics.2018:1398-1409.

[2]Liu J,He Z,Wei L,et al.Content to node:Self-translation network embedding.In:Proc.of the 24th International Conference on Knowledge Discovery&Data Mining.2018:1794-1802.

[3]Liu J,Li N,He Z,et al.Network Embedding with Dual Generation Tasks.In:Proc.of the 28th International Joint Conference on Artificial Intelligence,2019:5102-5108.

[4]Srivastava A,Sutton C.Autoencoding Variational Inference for Topic Models.In:Proc.of the 5th International Conference on Learning Representations,2017.

[5]Li J,Liao M,Gao W,et al.Topic Extraction from Microblog Posts Using Conversation Structures.In:Proc.of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1722–1731.

Claims

1. a microblog topic mining method based on parallel social context fusion in a variable neighborhood is characterized by comprising the following steps:

(1) constructing a user-level dialogue network;

(2) parallel social context sequence: setting different random walk lengths to obtain parallel content and a structural context sequence containing different-order user proximity;

(3) self-converged network representation: capturing the non-linear correlation between the text content and the network structure, and introducing the influence of different users on the theme on the attention machine modeling sequence to obtain the user sequence representation;

(4) topic generation based on neurovariational reasoning: and taking the user sequence representation as the input of the neural variation reasoning, and adaptively balancing the inherent complementarity between the content and the structure so as to mine the theme with better consistency.

2. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (1) specifically comprises the following steps:

the method comprises the steps that users are regarded as nodes in a conversation network, all microblogs related to the corresponding users, including source microblogs, comment microblogs and forwarding microblogs, are gathered into a document and regarded as text information of the nodes, and if the microblogs or comments exist among the users in the conversation network, the nodes representing the users are connected; constructing a user-level dialogue network G ═ (V, E, T) by the operation, wherein V is the collection of nodes in the dialogue network G,

is the set of edges in the dialogue network G, and T is the set of text information attached by the nodes; v. of_iIndicating the ith user in the V,

representing a user v_iThe identity of (a) of (b),

representing a user v_iWherein the subscript n is the document

The number of words in (2).

3. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (2) specifically comprises the following steps:

in a user-level dialogue network, random walk is carried out by taking any node as a starting point to obtain a truncated user sequence; capturing neighborhood contexts with different specifications for a user by setting different random walk lengths; suppose S ═ v₁，v₂，...，v_k) Representing a sequence of users, a sequence of user content of samples

And user structure sequence

Is a pair of parallel social context sequences containing content information and structural information, respectively, where k represents the length of the random walk.

4. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (3) specifically comprises the following steps:

given a sequence of user content

User will beText information

Each word w in_iSubstitution into corresponding word embedding

Thereby obtaining a text embedding matrix E_i＝(w₁，w₂，...，w_n) Where d' represents the dimension of word embedding; embedding matrix E for text_iPreserving potential local syntactic and semantic information in user text by using convolution and max-pooling operations, and encoding the information into user content embedding specifically, see formula (1):

v_i＝max(CNN(E_i)) (1)

original user content sequence via convolution and pooling operations

Conversion to user content embedding sequence (v)₁，v₂，...，v_k) (ii) a Embedding user content into a sequence as a user structure sequence

Then the input bi-directional LSTM captures forward context and backward context information over the sequence for the user:

in equations (2) and (3)

And

And

get user embedding h_iEmbedding the sequence (h) at the user₁，h₂，...，h_k) In the above, the influence of different users on the topic is converted into the corresponding importance coefficient calculated by using the attention mechanism, see formula (4):

(α₁，α₂，...，α_k)＝att(h₁，h₂，...，h_k) (4)

wherein alpha is_iRepresenting a user v_iThe contribution to the topic is shown in formula (5); first embed the user in h_iCarrying out nonlinear conversion, and then calculating the similarity with the attention vector q of the user to obtain alpha_i：

α_i＝q^T·tanh(W·h_i+b) (5)

W and b in the formula (5) are parameters of the neural network, are simultaneously shared with the attention vector q of the user to be learned for all user sequences and user embedding, and tanh (·) is a nonlinear activation function; further carrying out normalization through a softmax function to obtain a user v_iFor topic importance, see formula (6):

in the formula (6), N_iRepresenting a user v_iAnd includes v_i，α_jRepresenting a neighbour v_jImportance to the topic; by embedding all users in a weighted sequence, parallel content and structural context in a captured variable social neighborhood is obtainedUser sequence embedding s with non-linear correlation therebetween, see equation (7):

wherein N represents all users on the current sequence; to obtain the user sequence embedding s, the following objective function needs to be minimized:

in the formula (8), L_seqExpressed as a loss function learning the user sequence embedding s, C_i＝{v_j|v_j∈N_iC is less than or equal to | j-i |, j ≠ i } represents that the user v is_iC is the window size; p (v)_j|v_i) Representing a given user v_iNeighbor v_jThe conditional probability of (2) is formalized as shown in formula (9):

wherein h is_kIs user v_iOf arbitrary sequence-based neighbors v_k∈C_iEmbedding; the calculation cost of the conditional probability in the formula (9) is reduced by using a negative sampling technology, and an optimized objective function is obtained as follows:

where σ (x) · 1/(1+ exp (-x)) denotes a sigmoid function, and L denotes the number of negative samples.

5. The microblog topic mining method based on the variable neighborhood parallel social context fusion according to claim 1, wherein the step (4) specifically comprises the following steps:

by the use of spiritInferring document-topic distributions θ via variational reasoning_d＝(p(t₁|d)，p(t₂|d)，...，p(t_KId)) and topic-term distribution phi_w＝(p(w|t₁)，p(w|t₂)，...，p(w|t_K) Where d represents a document, t)_iRepresenting the ith theme, K representing the number of themes and w representing words; p (t)_iI d) (i ═ 1, 2.., K) represents the probability that document d belongs to the ith topic, p (w | t)_i) (i ═ 1, 2., K) denotes the probability that word w belongs to the ith topic;

h_enc＝ReLU(W^h·s+b^h) (11)

wherein W^hAnd b^hIs a parameter of the encoder, ReLU is a non-linear activation function; assuming that the prior distribution and the posterior distribution of the user sequence embedding s are both Gaussian distributions, the mean mu and the variance sigma of the posterior Gaussian distributions²Can be obtained by linear transformation, see equation (12) (13):

μ＝W^μ·h_enc+b^μ (12)

log(σ²)＝W^σ·h_enc+b^σ (13)

wherein W^μ、b^μ、W^σ、b^σIs a parameter of the encoder;

further obtaining potential semantic vector by using reparameterization skill

The formalization is shown in formula (14):

z＝μ+∈×σ (14)

wherein, the element belongs to the Gaussian distribution N (0, I) sampled, and the potential semantic vector z is normalized by utilizing the softmax function to obtain the document-subject distribution theta_d；

Topic-word distribution: topic-word distribution in text_wCan be regarded as parameters of a decoder, and is detailed in a formula(15)：

h_dec＝softmax(φ_w×(θ_d)^T) (15)

The user sequence embedding s is then reconstructed by the decoder to obtain a reconstructed user sequence embedding s', which is calculated as equation (16), where W^(d)And b^(d)Are all parameters of the decoder:

s′＝ReLU(W^dh_dec+b^d) (16)

for topic generation, the objective function for this section is formula (17):

in formula (17), L_genLoss function values expressed as learned document-topic distributions and topic-term distributions, using KL divergence to measure the closeness of a prior distribution q (z) to a true posterior distribution p (zs), where q (z) is a prior Gaussian distribution N (0, I);

by combining equations (10) and (17), an overall objective function is defined to mine microblog potential topics, see equation (18), where λ is a trade-off of L_seqAnd L_genThe hyper-parameters of (a):

L＝(1-λ)L_seq+λL_gen (18)。