CN110336700B

CN110336700B - Microblog popularity prediction method based on time and user forwarding sequence

Info

Publication number: CN110336700B
Application number: CN201910621977.8A
Authority: CN
Inventors: 黄宏宇; 刘海燕
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2021-09-14
Anticipated expiration: 2039-07-10
Also published as: CN110336700A

Abstract

The invention relates to a microblog popularity prediction model based on time and user's forwarding sequence, which belongs to the field of message popularity prediction in social networks. , used to capture the long-distance dependence of the message propagation process; S2: nonlinearly transform the output of the hidden layer to learn the rate at each time step in the propagation process; S3: use the rate to obtain the early trend acceleration and the early Popularity, and under the optimization of user activity, predict the future popularity of Weibo. The present invention ensures that the popular trend in the future can be more accurately predicted in the early stage of news dissemination, and the model not only utilizes historical dissemination information, but also describes the dissemination process of microblog well.

Description

Microblog popularity prediction method based on time and user forwarding sequence

Technical Field

The invention belongs to the field of message popularity prediction in social networks, and relates to a microblog popularity prediction model 5 based on time and a forwarding sequence of a user

Background

The popularity and cheapness of web2.0 services have changed the way content is generated and consumed online. In recent years, internet technology is rapidly developing, and with the rapid rise and popularization of the internet, our lives cannot leave the network at present. Due to the network, content producers can reach an unimaginable audience using traditional channels, and services involving video, photo, music sharing, weblogs, social bookmarking sites, collaboration portals, and content submission, browsing, conducting ratings and discussions of content news aggregators, etc., are implemented worldwide. Social networking services, represented by Facebook, Twitter, microblog, WeChat, etc., play an important role in propagating hot spot incidents, and users rely on these social networks to receive updates for personal and global hot news.

Social networks have gradually emerged, and people increasingly like to publish their own speech and comment events on the internet. Social networks such as microblogs bring great convenience to people to acquire and share information. However, people are impacted by social networks while enjoying the benefits of the social networks, such as unrealistic messages and defamation spread by people on the internet, and if the messages are spread rapidly in the network, the judgment of people is affected, and people receive false information, so that unpredictable loss is caused. Therefore, if the fashion trend of the event can be predicted in advance in the early period of the event, public opinion control is well achieved for relevant government departments, and a company can greatly help to deal with the emergency in advance. The popularity prediction problem is a work with great value when the hot spots are exploded and the server is down. It is of great significance to network dimensions (e.g., caching and replication), online marketing (e.g., recommendation systems and media advertisements) or real-world outcome prediction (e.g., economic trends), emergency management, but is also a very difficult problem due to the structure of the social network itself and the large number of users.

Currently, the popularity prediction problem is generally solved by three methods. In detail, one is a machine learning method based on features, which adopts a classification or regression model to perform modeling, and the key point of the problem becomes the feature extraction, and the other is a method based on a point random process, which is used for modeling the message propagation process, can better depict the message propagation process and learn the message arrival process. The other is based on an infectious disease model, and a kinetic equation is used for expressing the message transmission rule. Classification or regression based models rely on feature extraction, do not characterize the process of message propagation, point random process based methods are deficient in performance and cannot adapt to every social network due to the diversity of social networks and do not take advantage of historical message supervision. Based on the analysis, a microblog popularity prediction model based on time and a forwarding sequence of a user is provided.

Disclosure of Invention

In view of the above, the present invention provides a microblog popularity prediction model based on time and a forwarding sequence of a user, which utilizes a recurrent neural network to model the forwarding sequence of a microblog and is used to capture long-distance dependence of a message propagation process, then performs a nonlinear transformation network on an output result of a hidden layer, learns a rate of each time step in the propagation process, and finally predicts future popularity of the microblog by using an early trend acceleration and an early popularity obtained by the rate under optimization of user liveness.

In order to achieve the purpose, the invention provides the following technical scheme:

a microblog popularity prediction model based on time and user forwarding sequence comprises

S1: modeling a microblog forwarding sequence by utilizing a recurrent neural network, and capturing long-distance dependence of a message propagation process;

s2: carrying out a nonlinear transformation network on the output result of the hidden layer, and learning the rate of each time step in the transmission process;

s3: and predicting the future popularity of the microblog by using the early trend acceleration and the early popularity obtained by the speed under the optimization of the activity of the user.

Further, step S1 includes the steps of:

s11: mapping of time vectors, converting each time composition unit into the length of the unit according to the unit at the upper stage, then setting the length of the unit in the vector, vectorizing user information, collecting historical microblog text information of each user in a microblog, aggregating the historical microblog text information into a document representing the user, aggregating all user documents into a document set, randomly generating topic-word distribution of each topic and document-topic distribution of each user microblog document, generating words in all documents according to the document-topic distribution and the topic-word distribution, continuously training the models according to Gibbs sampling of an LDA topic model, finally obtaining the topic distribution of each user document, and using the topic distribution as an interest vector of the user;

s12: splicing time and a user vector to be input as a whole, and performing embedding operation according to a certain rule;

s13: inputting the result of the step S12 into a recurrent neural network, inputting the result into a bottom RNN through an embedding layer for propagation training, solving the problem of gradient disappearance in a standard neural network by adopting an LSTM as the recurrent neural network, and finally obtaining hidden layer output of each time step through a forgetting gate, an input gate and an output gate;

the forget gate formula is:

f_t＝σ(W_f.[h_t-1,x_t]+b_f)，

wherein x is_tIs the input of the t-th layer, h_tHidden layer information, h, representing the current time step_t-1Denotes hidden layer information at the previous time step, ". denotes multiplication of vectors, middle brackets denote that two vectors are connected and merged, σ (-) is a sigmoid activation function, W_fAs a weight matrix, b_fIs a bias vector.

The input gate and network status updates are:

i_t＝σ(W_i.[h_t-1，x_t]+b_i)，

wherein, W_CAnd b_CRespectively representing a weight matrix and a bias vector, and tanh is a hyperbolic tangent function;

the output gate is:

o_t＝σ(W_O.[h_t-1,x_t])+b_o)，h_t＝o_t*tanh(C_t)

wherein, W_OAnd b_oRespectively, the weight matrix and the bias parameters of the output gates.

Further, in step S2, the hidden layer output of the recurrent neural network is obtained, then nonlinear transformation is performed to obtain the propagation rate of the microblog at each forwarding time, the forwarding process of the message is modeled as a random point process, and the calculation formula is as follows:

v_t＝exp(W^mh_t+b^m)

wherein, W^mAs a weight matrix, b^mAs a bias parameter, H_tIs reflected in W^mh_tUpper, h_tIs the hidden layer information of the recurrent neural network and also represents the historical information in the sequence data.

Further, step S3 includes the steps of:

s31: the obtained rate function is used for calculating the propagation trend acceleration of the microblogs in the observation time, propagation trends of different types of microblogs are greatly different, and the propagation trend difference leads to future popularity, so that a feature capable of indicating the popularity trend change needs to be found and fused into a model, and the future popularity of the microblogs can be more accurately predicted, and the calculation formula is as follows:

wherein, T_obsRepresenting observation time, n representing the number of elements in the forwarding sequence, v_iA rate function representing each forwarding instant;

s32: and quantifying the user activity to obtain the user activity of each time period on the microblog platform. The specific quantization formula is as follows:

n (t) represents the average number of microblogs issued by the user from the start time of a day to the current time t, η represents the average number of microblogs issued by the user in unit time on the microblog platform, and the unit time can be hours, minutes and seconds.

S33: dividing the trending acceleration and the popularity of the early message of step S31 by the user activity of step S32, respectively, yields a relative trending acceleration and a relative popularity, as follows:

then, combining the two to establish a linear regression model, wherein the calculation formula is as follows:

wherein, beta₀,β₁,β₂Are model parameters.

The invention has the beneficial effects that: the method and the device ensure that the future fashion trend of the message is predicted more accurately in the early stage of message propagation, and the model not only utilizes historical propagation information, but also well describes the propagation process of the microblog.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a system diagram of a microblog popularity prediction model based on time and a user's forwarding sequence;

FIG. 2 is a user vector generation process in a forwarding sequence;

fig. 3 is a schematic diagram of the operation of an input vector by LSTM.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Before introducing the summary of the solution, 7 necessary concepts of the invention are presented.

Concept 1: predicting the popularity of the message, wherein the message refers to information generated in a social network, such as a microblog in a Sina microblog, and the popularity refers to a final result of the future propagation of the message and can be measured by the forwarding times of the microblog; message popularity refers to predicting the specific number of forwards a message will be in the future early in its publication.

Concept 2: a recurrent neural network is a neural network for processing sequence data, for example, time sequence data refers to data collected at different time points, and such data reflects the changing state or degree of a certain object, phenomenon, etc. with time. The invention is an LSTM network, and the idea of LSTM is to reasonably utilize three gates. The first is a forgetting gate which is responsible for controlling to continuously save the state of the long-term unit; the second is an input gate which is responsible for controlling the input of the network at the current moment to the long-term unit state; the third is an output gate which is responsible for controlling whether the long term cell state is taken as the current LSTM output.

Concept 3: the topic model is a method for modeling texts and learning the implicit topic distribution in the texts, overcomes the defects of a document similarity calculation method in the traditional information retrieval, and can automatically find out semantic topics among characters in massive Internet data.

Concept 4: the linear regression model, which is mainly a learning linear model, aims to predict the output of input values almost accurately. In this model, the dependent variable is continuous, and the independent variable may be continuous or discrete. If only one independent variable and one dependent variable are included and the relationship can be approximately represented by a straight line, the analysis is called unary linear regression analysis. If two or more independent variables are included in the regression analysis and the dependent variable and the independent variable are in a linear relationship, it is referred to as a multiple linear regression analysis.

Concept 5: the point random process is called a point process on a positive real number domain by setting that the forwarding time in a certain microblog forwarding sequence is a non-negative random variable generated according to a time sequence, and the definition formula is as follows:

wherein H_tThe historical propagation process between the forwarding moments t is shown, the above formula shows the relation of the rate changing along with the time in the microblog propagation process, and H is added_tBecause it is considered that the current forwarding action is influenced by the history propagation process.

Concept 6: the observation time, the time elapsed when the message publication was propagated for a period of time before the prediction began.

Concept 7: the popularity of messages tends to stabilize for the time that does not grow any longer.

The invention provides a microblog popularity prediction model based on time and a user forwarding sequence, which takes the information of a Xinlang microblog source microblog and subsequent forwarded microblog information as training sets and can more accurately predict the future popularity of the microblog after training. The model is modeled by utilizing a forwarding sequence of the microblog, the purpose of predicting the future popularity of the message is finally achieved, the model is totally divided into three parts, as shown in figure 1, in the first part, the forwarding sequence of the microblog is modeled by utilizing a recurrent neural network and is used for capturing the long-distance dependence of the message propagation process; the second part carries out nonlinear transformation network on the output result of the hidden layer and learns the speed of each time step in the transmission process; and the third part predicts the future popularity of the microblog by using the early trend acceleration and the early popularity obtained by the speed under the optimization of the activity of the user.

1. The first part comprises the following three steps:

step 1: the mapping of the time vector, for each time component unit, converts to the length of the unit according to the unit of the upper level, and then sets its length in the vector. For example, in the unit of minute, the unit of the upper level is hour, and one hour has 60 minutes, so according to the above definition, the length of the minute in the vector is 60, and given a time at will, the minute time in the time vector can be known, and the number m is obtained by taking the modulus of the length of the unit, and then the m-th position of the corresponding unit in the time vector is 1, and the rest positions are 0, so that the numerical value of the minute can be represented in the time vector. Vectorizing user information, collecting historical microblog text information of each user in a microblog, aggregating the historical microblog text information into a document representing the user, aggregating all user documents into a document set, randomly generating topic-word distribution of each topic and document-topic distribution of each user microblog document, generating words in all documents according to the document-topic distribution and the topic-word distribution, continuously performing model training according to Gibbs sampling of an LDA topic model, finally obtaining the topic distribution of each user document, and using the topic distribution as an interest vector of the user, wherein the specific process is shown in FIG. 2.

Step 2: the time and the user vector are spliced together to be input as a whole, and embedding operation is carried out according to a certain rule.

And step 3: and (3) inputting the result of the step (2) into a recurrent neural network, and then inputting the result into a bottom RNN through an embedding layer for propagation training, wherein the standard recurrent neural network has the gradient disappearance problem, and in order to solve the problem, a LSTM based on a door mechanism can be adopted. The LSTM is characterized in that the output of the hidden layer depends not only on the current input but also on the output of the previous layer, and the output of the hidden layer is obtained through the forgetting gate, the input gate and the output gate, and the specific process is shown in fig. 3. The forget gate formula is: f. of_t＝σ(W_f.[h_t-1,x_t]+b_f) The input gate and network status are updated as follows: i.e. i_t＝σ(W_i.[h_t-1,x_t]+b_i)，

The output gate is o_t＝σ(W_O.[h_t-1,x_t])+b_o)， h_t＝o_t*tanh(C_t)。

2. A second part comprising one of the steps of:

step 1: and acquiring hidden layer output of the recurrent neural network, and then performing nonlinear transformation to obtain the propagation rate of the microblog at each forwarding moment. The forwarding process of the message is modeled as a random point process, and a specific calculation formula is as follows:

v_t＝exp(W^mh_t+b^m)

3. The third part comprises the following three steps:

step 1: and calculating the propagation trend acceleration of the microblog in the observation time by using the obtained rate function. The spreading trends of different types of microblogs are greatly different, and the differences of the spreading trends lead to future popularity, so that a feature capable of showing the changes of the popularity trends needs to be found and is fused into a model, and the future popularity of the microblogs can be accurately predicted. The calculation formula is as follows:

wherein, T_obsRepresenting observation time, n representing the number of elements in the forwarding sequence, v_iRepresenting a rate function for each forwarding instant.

And 2, quantifying the user activity to obtain the user activity of each time period on the microblog platform. The specific quantization formula is as follows:

And 3, dividing the trend acceleration and the popularity of the early message in the step 1 by the user activity in the step 2 respectively to obtain the relative trend acceleration and the relative popularity, wherein the relative trend acceleration and the relative popularity are as follows:

wherein, beta₀,β₁,β₂Are model parameters.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. a microblog popularity prediction method based on time and user's forwarding sequence, is characterized in that: comprise the following steps:

S1: Use recurrent neural network to model the forwarding sequence of Weibo to capture the long-distance dependence of the message propagation process;

S2: Obtain the output of the hidden layer of the recurrent neural network, and then perform nonlinear transformation to obtain the propagation rate of the microblog at each forwarding moment;

S3: Using the early trend acceleration and early popularity obtained by the rate, and under the optimization of user activity, predict the future popularity of Weibo; including the following steps:

S31: Use the obtained rate function to calculate the propagation trend acceleration from the microblog to the observation time, and the calculation formula is as follows:

Among them, T _obs represents the observation time, _n represents the number of elements in the forwarding sequence, and vi represents the rate function at each forwarding moment;

S32: Quantify the user activity to obtain the user activity of each time period on the Weibo platform. The specific quantification formula is as follows:

Wherein, N(t) represents the average number of microblogs posted by users from the start of the day to the current time t, and n represents the average number of microblogs posted by users within a unit time on the microblog platform;

S33: Divide the trend acceleration in step S31 and the early popularity of the message by the user activity in step S32 to obtain the relative trend acceleration and relative popularity, as follows:

Then combine the two to establish a linear regression model, the calculation formula is as follows:

Among them, β ₀ , β ₁ , and β ₂ are model parameters.

2. the microblog popularity prediction method based on time and user's forwarding sequence according to claim 1, is characterized in that: step S1 comprises the following steps:

S11: The mapping of the time vector, for each time component unit, convert it to the length of the unit according to the unit of the previous level, then set its length in the vector, then vectorize the user information, collect each microblog in the The user's historical microblog text information is aggregated into documents representing the user, all user documents are aggregated into a document set, and the topic-word distribution of each topic and the document-topic distribution of each user's microblog document are randomly generated, according to the document- Topic distribution and topic-word distribution, generate words in all documents, continuously train the model according to the Gibbs sampling of the LDA topic model, and finally obtain the topic distribution of each user document, and use the topic distribution as the user's interest vector;

S12: Concatenate time and user vector as a whole input, and perform the embedding operation according to certain rules;

S13: The result of step S12 is used as input into the recurrent neural network, which is input into the underlying RNN through the embedding layer for propagation training, and LSTM is used as the recurrent neural network to solve the problem of gradient disappearance in the standard neural network. After the forgetting gate, input The gate and output gate finally get the hidden layer output of each time step;

The forget gate formula is:

f _t =σ(W _f .[h _t-1 ,x _t ]+b _f ),

Among them, x _t is the input of the t-th layer, h _t represents the hidden layer information of the current time step, h _t-1 represents the hidden layer information of the previous time step, "." represents the multiplication operation of the vector, and square brackets represent two The vectors are connected and merged, σ( ) is the sigmoid activation function, W _f is the weight matrix, and b _f is the bias vector

The input gate and network state are updated as:

i _t =σ(W _i .[h _t-1 ,x _t ]+ _bi ),

Among them, W _C and b _C represent the weight matrix and the bias vector, respectively, and tanh is the hyperbolic tangent function;

The output gate is:

o _t =σ(W _O .[h _t-1 ,x _t ])+b _o ), h _t =o _t *tanh(C _t )

where W _o and b _o are the weight matrix and bias parameters, respectively.

3. the microblog popularity prediction method based on time and user's forwarding sequence according to claim 1, is characterized in that: in described step S2, obtain the hidden layer output of recurrent neural network, then carry out nonlinear transformation, obtain The propagation rate of Weibo at each forwarding moment, the forwarding process of the message is modeled as a random point process, and the calculation formula is as follows:

v _t =exp(W ^m h _t +b ^m )

Among them, W ^m is the weight matrix, b ^m is the bias parameter, the influence of H _t is reflected in W ^m h _t , h _t is the hidden layer information of the recurrent neural network, and also represents the historical information in the sequence data.